Hadoop Structure

7 min readEdit on GitHub

Hadoop Structure

When you install Hadoop on Linux/WSL, all files usually go inside a single root folder (like /usr/local/hadoop or /opt/hadoop or /opt/homebrew/opt/hadoop).
For example, if Hadoop is installed at:
text
/usr/local/hadoop
Then the folder structure looks something like this:
text
/usr/local/hadoop
├── bin
├── sbin
├── etc
│ └── hadoop
├── share
│ └── hadoop
│ ├── common
│ ├── hdfs
│ ├── mapreduce
│ └── yarn
├── lib
├── logs
└── tmp
Let’s explore each directory in detail.

1. bin/

Location:
/usr/local/hadoop/bin/
Purpose:
Contains Hadoop client commands (executables) that you use to interact with HDFS or run jobs.
These are the commands you usually type in terminal.
Examples of commands here:
  • hadoop
  • hdfs
  • yarn
  • mapred
When you run:
hljs bash
hdfs dfs -ls /
The hdfs command is coming from this folder.

2. sbin/

Location:
/usr/local/hadoop/sbin/
Purpose:
Contains system-level scripts to start/stop Hadoop services.
Common scripts:
  • start-dfs.sh: Starts HDFS daemons (NameNode, DataNode)
  • stop-dfs.sh: Stops HDFS daemons
  • start-yarn.sh: Starts YARN daemons (ResourceManager, NodeManager)
  • stop-yarn.sh: Stops YARN daemons
You use these to get your cluster up and running.

3. etc/hadoop/

Location:
/usr/local/hadoop/etc/hadoop/
Purpose:
This is your Hadoop configuration directory.
All important config files that you manually edit are located here.
Let’s go through the major ones:

3.1 core-site.xml

  • Defines core Hadoop settings (e.g., default filesystem URI)
  • Example: fs.defaultFS = hdfs://localhost:9000

3.2 hdfs-site.xml

  • Configures NameNode and DataNode settings
  • Settings include replication factor, data directory paths, etc.

3.3 yarn-site.xml

  • Configures YARN resource manager and node manager
  • Controls memory allocation, scheduling, etc.

3.4 mapred-site.xml

  • Configures MapReduce execution settings
  • Example: sets MapReduce framework to run on YARN

3.5 hadoop-env.sh

  • Sets environment variables
  • Most important: JAVA_HOME must be set correctly here

4. share/hadoop/

Location:
/usr/local/hadoop/share/hadoop/
Contains libraries and JARs for:
  • Hadoop Common
  • HDFS
  • MapReduce
  • YARN
Each subdirectory contains code and dependent files required to run jobs.
Example:
text
/share/hadoop/hdfs/hadoop-hdfs-3.3.1.jar
You generally don't need to modify anything in this directory.

5. lib/

Location:
/usr/local/hadoop/lib/
Contains native libraries used by Hadoop at runtime.
Again — you don't edit this directly, but Hadoop uses it internally.

6. logs/

Location:
/usr/local/hadoop/logs/
Logs generated by Hadoop during runtime for different daemons.
Example logs:
  • namenode.log
  • datanode.log
  • resourcemanager.log
When things go wrong, you check these logs.

7. tmp/

Location:
/usr/local/hadoop/tmp/
This directory is used as temporary storage for Hadoop.
If you configured this directory in core-site.xml:
hljs xml
<property>
 <name>hadoop.tmp.dir</name>
 <value>/usr/local/hadoop/tmp</value>
</property>
Hadoop stores NameNode metadata, temp files here.

8. Data Directories (NameNode/DataNode)

These folders are created by you and referenced in configuration inside hdfs-site.xml.
Example entries in your config:
hljs xml
<property>
 <name>dfs.namenode.name.dir</name>
 <value>file:///usr/local/hadoop/data/namenode</value>
</property>

<property>
 <name>dfs.datanode.data.dir</name>
 <value>file:///usr/local/hadoop/data/datanode</value>
</property>
These directories store persistent data such as HDFS metadata (NameNode) and actual blocks (DataNode).

What Hadoop Provides After Installation

When Hadoop is fully installed and configured, it gives you four major layers of functionality along with several tools and daemons.

1. Hadoop Distributed File System (HDFS)

What it is:
A distributed file system that lets you store large files across multiple machines
Key components:
  • NameNode (Master)
Stores metadata — file paths, permissions, block locations
Think of it as a "file system manager"
  • Secondary NameNode
Assists NameNode by taking periodic snapshots of metadata
Note: Not a standby node; just used for checkpointing
  • DataNode (Worker)
Stores actual data blocks
Each DataNode talks to the NameNode and serves read/write requests
What you can do:
  • Upload files into HDFS
  • Read data from HDFS like a local file system, but distributed
Commands:
hdfs dfs -ls, -put, -get, -rm, etc.

2. YARN (Yet Another Resource Negotiator)

What it is:
A cluster management layer that manages resources and schedules jobs.
Key components:
  • ResourceManager (Master)
Allocates resources to applications (like Spark or MapReduce jobs)
  • NodeManager (Worker)
Runs on every node and manages resources for a single machine
Communicates with ResourceManager
Why it’s important:
Without YARN, Hadoop can't execute and manage jobs efficiently in a distributed cluster.
Web UI example:
Check status at: http://localhost:8088/cluster

3. MapReduce (Optional Processing Engine)

What it is:
A processing model to handle large-scale data processing in parallel (older method — now often replaced by Spark)
Key components:
  • Map Phase: Splits and transforms input data
  • Reduce Phase: Aggregates or combines mapped data
You get:
  • mapred command-line tool
  • Ability to run MapReduce jobs on your cluster

4. Hadoop Common

What it is:
A collection of utility libraries and tools used across Hadoop modules
Includes:
  • Java libraries
  • Configuration utilities
  • System tools
These are used internally by HDFS, YARN, and MapReduce.

Full List of Hadoop Daemons Installed

DaemonPurpose
NameNodeManages HDFS metadata
DataNodeStores blocks of data
Secondary NameNodeSupports NameNode with checkpointing
ResourceManagerManages YARN cluster resources
NodeManagerRuns tasks per node (assigned by ResourceManager)
JobHistoryServer (if enabled)Stores history of finished MapReduce jobs
You can check these running using:
hljs bash
jps
Example output:
text
NameNode
DataNode
ResourceManager
NodeManager
SecondaryNameNode

Hadoop Commands You Get

After installation, you get these commands available in bin/:
  • hdfs → Hadoop File System commands
Example: hdfs dfs -ls /
  • hadoop → General Hadoop command wrapper
Example: hadoop version
  • yarn → For submitting and managing YARN jobs
Example: yarn application -list
  • mapred → Commands for MapReduce jobs
Example: mapred job -status

What You Can Do After Setup

  • Store any large files in HDFS
  • Run distributed processing jobs
  • Connect Spark, Hive, or other engines to use Hadoop infrastructure
  • Build end-to-end data pipelines

Guide: Import data.csv into HDFS on WSL

1. Make Sure Hadoop is Running

First, start all required Hadoop services:
hljs bash
start-dfs.sh
start-yarn.sh
Confirm the services using:
hljs bash
jps
Expected output includes:
text
NameNode
DataNode
SecondaryNameNode
ResourceManager
NodeManager

2. Locate data.csv in WSL

Your Windows files are accessible in WSL under /mnt/c.
The full path to your CSV file in WSL will be:
text
/mnt/c/Users/harsh/Downloads/data.csv
To check if the file exists:
hljs bash
ls /mnt/c/Users/harsh/Downloads/
You should see data.csv in the output.

3. Copy File to WSL Home (Optional)

You can either work directly from /mnt/c or copy it to your WSL user directory.
To copy:
hljs bash
cp /mnt/c/Users/harsh/Downloads/data.csv ~/data.csv
Then confirm:
hljs bash
ls ~

4. Create a Directory in HDFS to Upload Data

Create a directory in HDFS where your CSV will be stored:
hljs bash
hdfs dfs -mkdir -p /user/hadoop/data

5. Upload CSV into HDFS

If your file is still in the Windows path:
hljs bash
hdfs dfs -put /mnt/c/Users/harsh/Downloads/data.csv /user/hadoop/data/
Or, if you copied it into your WSL home:
hljs bash
hdfs dfs -put ~/data.csv /user/hadoop/data/

6. Verify Upload

To list the file in HDFS:
hljs bash
hdfs dfs -ls /user/hadoop/data
You should see something like:
text
-rw-r--r-- 1 hadoop supergroup XXXXXXX /user/hadoop/data/data.csv

7. View Data in HDFS

To see first few lines of the file in HDFS:
hljs bash
hdfs dfs -cat /user/hadoop/data/data.csv | head