Hadoop Structure

When you install Hadoop on Linux/WSL, all files usually go inside a single root folder (like /usr/local/hadoop or /opt/hadoop or /opt/homebrew/opt/hadoop).

For example, if Hadoop is installed at:

text

/usr/local/hadoop

Then the folder structure looks something like this:

text

/usr/local/hadoop
├── bin
├── sbin
├── etc
│ └── hadoop
├── share
│ └── hadoop
│ ├── common
│ ├── hdfs
│ ├── mapreduce
│ └── yarn
├── lib
├── logs
└── tmp

Let’s explore each directory in detail.

1. `bin/`

Location:

/usr/local/hadoop/bin/

Purpose:

Contains Hadoop client commands (executables) that you use to interact with HDFS or run jobs.

These are the commands you usually type in terminal.

Examples of commands here:

hadoop
hdfs
yarn
mapred

When you run:

hljs bash

hdfs dfs -ls /

The hdfs command is coming from this folder.

2. `sbin/`

Location:

/usr/local/hadoop/sbin/

Purpose:

Contains system-level scripts to start/stop Hadoop services.

Common scripts:

start-dfs.sh: Starts HDFS daemons (NameNode, DataNode)
stop-dfs.sh: Stops HDFS daemons
start-yarn.sh: Starts YARN daemons (ResourceManager, NodeManager)
stop-yarn.sh: Stops YARN daemons

You use these to get your cluster up and running.

3. `etc/hadoop/`

Location:

/usr/local/hadoop/etc/hadoop/

Purpose:

This is your Hadoop configuration directory.

All important config files that you manually edit are located here.

Let’s go through the major ones:

3.1 `core-site.xml`

Defines core Hadoop settings (e.g., default filesystem URI)
Example: fs.defaultFS = hdfs://localhost:9000

3.2 `hdfs-site.xml`

Configures NameNode and DataNode settings
Settings include replication factor, data directory paths, etc.

3.3 `yarn-site.xml`

Configures YARN resource manager and node manager
Controls memory allocation, scheduling, etc.

3.4 `mapred-site.xml`

Configures MapReduce execution settings
Example: sets MapReduce framework to run on YARN

3.5 `hadoop-env.sh`

Sets environment variables
Most important: JAVA_HOME must be set correctly here

4. `share/hadoop/`

Location:

/usr/local/hadoop/share/hadoop/

Contains libraries and JARs for:

Hadoop Common
HDFS
MapReduce
YARN

Each subdirectory contains code and dependent files required to run jobs.

Example:

text

/share/hadoop/hdfs/hadoop-hdfs-3.3.1.jar

You generally don't need to modify anything in this directory.

5. `lib/`

Location:

/usr/local/hadoop/lib/

Contains native libraries used by Hadoop at runtime.

Again — you don't edit this directly, but Hadoop uses it internally.

6. `logs/`

Location:

/usr/local/hadoop/logs/

Logs generated by Hadoop during runtime for different daemons.

Example logs:

namenode.log
datanode.log
resourcemanager.log

When things go wrong, you check these logs.

7. `tmp/`

Location:

/usr/local/hadoop/tmp/

This directory is used as temporary storage for Hadoop.

If you configured this directory in core-site.xml:

hljs xml

<property>
 <name>hadoop.tmp.dir</name>
 <value>/usr/local/hadoop/tmp</value>
</property>

Hadoop stores NameNode metadata, temp files here.

8. Data Directories (NameNode/DataNode)

These folders are created by you and referenced in configuration inside hdfs-site.xml.

Example entries in your config:

hljs xml

<property>
 <name>dfs.namenode.name.dir</name>
 <value>file:///usr/local/hadoop/data/namenode</value>
</property>

<property>
 <name>dfs.datanode.data.dir</name>
 <value>file:///usr/local/hadoop/data/datanode</value>
</property>

These directories store persistent data such as HDFS metadata (NameNode) and actual blocks (DataNode).

What Hadoop Provides After Installation

When Hadoop is fully installed and configured, it gives you four major layers of functionality along with several tools and daemons.

1. Hadoop Distributed File System (HDFS)

What it is:

A distributed file system that lets you store large files across multiple machines

Key components:

NameNode (Master)

Stores metadata — file paths, permissions, block locations

Think of it as a "file system manager"

Secondary NameNode

Assists NameNode by taking periodic snapshots of metadata

Note: Not a standby node; just used for checkpointing

DataNode (Worker)

Stores actual data blocks

Each DataNode talks to the NameNode and serves read/write requests

What you can do:

Upload files into HDFS
Read data from HDFS like a local file system, but distributed

Commands:

hdfs dfs -ls, -put, -get, -rm, etc.

2. YARN (Yet Another Resource Negotiator)

What it is:

A cluster management layer that manages resources and schedules jobs.

Key components:

ResourceManager (Master)

Allocates resources to applications (like Spark or MapReduce jobs)

NodeManager (Worker)

Runs on every node and manages resources for a single machine

Communicates with ResourceManager

Why it’s important:

Without YARN, Hadoop can't execute and manage jobs efficiently in a distributed cluster.

Web UI example:

Check status at: http://localhost:8088/cluster

3. MapReduce (Optional Processing Engine)

What it is:

A processing model to handle large-scale data processing in parallel (older method — now often replaced by Spark)

Key components:

Map Phase: Splits and transforms input data
Reduce Phase: Aggregates or combines mapped data

You get:

mapred command-line tool
Ability to run MapReduce jobs on your cluster

4. Hadoop Common

What it is:

A collection of utility libraries and tools used across Hadoop modules

Includes:

Java libraries
Configuration utilities
System tools

These are used internally by HDFS, YARN, and MapReduce.

Full List of Hadoop Daemons Installed

Daemon	Purpose
NameNode	Manages HDFS metadata
DataNode	Stores blocks of data
Secondary NameNode	Supports NameNode with checkpointing
ResourceManager	Manages YARN cluster resources
NodeManager	Runs tasks per node (assigned by ResourceManager)
JobHistoryServer (if enabled)	Stores history of finished MapReduce jobs

You can check these running using:

hljs bash

jps

Example output:

text

NameNode
DataNode
ResourceManager
NodeManager
SecondaryNameNode

Hadoop Commands You Get

After installation, you get these commands available in bin/:

hdfs → Hadoop File System commands

Example: hdfs dfs -ls /

hadoop → General Hadoop command wrapper

Example: hadoop version

yarn → For submitting and managing YARN jobs

Example: yarn application -list

mapred → Commands for MapReduce jobs

Example: mapred job -status

What You Can Do After Setup

Store any large files in HDFS
Run distributed processing jobs
Connect Spark, Hive, or other engines to use Hadoop infrastructure
Build end-to-end data pipelines

Guide: Import `data.csv` into HDFS on WSL

1. Make Sure Hadoop is Running

First, start all required Hadoop services:

hljs bash

start-dfs.sh
start-yarn.sh

Confirm the services using:

hljs bash

jps

Expected output includes:

text

NameNode
DataNode
SecondaryNameNode
ResourceManager
NodeManager

2. Locate `data.csv` in WSL

Your Windows files are accessible in WSL under /mnt/c.

The full path to your CSV file in WSL will be:

text

/mnt/c/Users/harsh/Downloads/data.csv

To check if the file exists:

hljs bash

ls /mnt/c/Users/harsh/Downloads/

You should see data.csv in the output.

3. Copy File to WSL Home (Optional)

You can either work directly from /mnt/c or copy it to your WSL user directory.

To copy:

hljs bash

cp /mnt/c/Users/harsh/Downloads/data.csv ~/data.csv

Then confirm:

hljs bash

ls ~

4. Create a Directory in HDFS to Upload Data

Create a directory in HDFS where your CSV will be stored:

hljs bash

hdfs dfs -mkdir -p /user/hadoop/data

5. Upload CSV into HDFS

If your file is still in the Windows path:

hljs bash

hdfs dfs -put /mnt/c/Users/harsh/Downloads/data.csv /user/hadoop/data/

Or, if you copied it into your WSL home:

hljs bash

hdfs dfs -put ~/data.csv /user/hadoop/data/

6. Verify Upload

To list the file in HDFS:

hljs bash

hdfs dfs -ls /user/hadoop/data

You should see something like:

text

-rw-r--r-- 1 hadoop supergroup XXXXXXX /user/hadoop/data/data.csv

7. View Data in HDFS

To see first few lines of the file in HDFS:

hljs bash

hdfs dfs -cat /user/hadoop/data/data.csv | head