HDFS Overview

HDFS is the storage layer of Hadoop. Its job is simple to say, but massive to implement:

Store huge datasets reliably across many machines, and let processing frameworks (MapReduce, Spark, Hive) read/write efficiently.

1. Why HDFS Exists

Traditional local filesystems (ext4, NTFS) are not made for:

Very large files (GB–TB)
Distributed clusters
Automatic replication
Streaming large files for analytics

A single machine = single point of failure.

If that machine dies, your data dies.

HDFS solves that by:

Splitting files into blocks
Storing blocks across many machines
Keeping replicas automatically
Recovering from failures automatically

This is why Big Data systems use HDFS.

2. HDFS Architecture Overview

Very simple to visualize:

NameNode

Stores metadata (not actual data)
File names, permissions, block locations
Keeps mapping:

/movies/moviesdata.jsonl → block1 on DN1, block2 on DN3, block3 on DN2

DataNode

Stores the actual data blocks
Sends heartbeat to NameNode
Sends block reports regularly

3. How HDFS Stores Files Internally (Blocks)

This is the core of HDFS.

Default block size:

128 MB (older versions had 64MB)

When you upload a file:

text

Suppose file size = 350MB
Block size = 128MB

HDFS breaks it like this:

text

Block 1 = 128 MB
Block 2 = 128 MB
Block 3 = 94 MB

Blocks are distributed across DataNodes.

HDFS never stores the entire file on a single machine.

That’s why large datasets become manageable.

4. Replication (Fault Tolerance)

Default replication factor = 3

Meaning each block is stored 3 times on 3 different DataNodes.

So:

text

Block 1 → DN1, DN2, DN3
Block 2 → DN3, DN4, DN1
Block 3 → DN2, DN4, DN5

If one DataNode goes down, you still have two copies.

NameNode constantly ensures replication.

If a block replica is lost, NameNode automatically creates a new replica somewhere else.

5. Write Pipeline (How data enters HDFS)

This is important for deep understanding.

When you run:

text

hdfs dfs -put file.txt /movies/

Process:

Client asks NameNode: "I want to write a file."
NameNode replies with:

"Here are 3 DataNodes for the first block."

Client sends block data to DataNode 1.
DataNode 1 streams it to DataNode 2.
DataNode 2 streams it to DataNode 3.
DataNode 3 sends acknowledgment back up the chain.
Client asks for next block location.

So data flows like a pipeline:

text

Client → DN1 → DN2 → DN3

This is extremely efficient.

6. Read Pipeline (How HDFS returns data)

When you run:

text

hdfs dfs -cat /movies/file.txt

Process:

Client asks NameNode: "Where are the blocks?"
NameNode sends list of block locations.
Client picks the nearest DataNode.
Reads block1, block2, block3 sequentially.
Reconstructs file on the client side.

NameNode does NOT supply data.

Only DataNodes do.

That’s why NameNode is light and DataNodes handle heavy IO.

7. NameNode Internals (Very important)

NameNode stores:

1. File system namespace

Directory structure
File names
Permissions

2. Block map

Mapping of blockID → DataNode list

3. Two critical files:

fsimage (checkpointed metadata snapshot)
edits (log of recent filesystem operations)

Combined, these define the entire HDFS namespace.

NameNode memory

Metadata is held in RAM for speed.

This is why NameNode needs high memory.

8. DataNode Internals

DataNodes store actual data in block form inside:

text

dfs.data.dir

Each block is a file on Linux filesystem.

DataNode sends 2 kinds of information regularly:

1. Heartbeats (every 3 seconds)

"Yes, I'm alive."

2. Block reports

"What blocks I have."

If DataNode stops sending heartbeat for 10 minutes:

NameNode marks it as dead
Re-replicates blocks elsewhere

9. Block Placement Strategy

HDFS places replicas intelligently:

Replication factor = 3

Usual placement:

One replica on local rack
One on a different rack
One on same rack as the second

This avoids data loss even if an entire rack fails.

10. Heartbeats

Heartbeat = health signal.

If heartbeat missing:

DataNode declared dead
Blocks on it are considered lost
NameNode replicates from remaining copies

11. Rack Awareness

Hadoop cluster has racks of machines.

HDFS knows which DataNode is in which rack.

Goal:

Avoid putting all replicas in the same rack
Reduce network traffic
Improve fault tolerance

Rack awareness is critical in large clusters.

12. HDFS Commands in Context

Commands you use daily interact with NameNode and DataNode.

Example:

Create directory

text

hdfs dfs -mkdir /movies

NameNode updates namespace.

Upload file

text

hdfs dfs -put file.jsonl /movies

NameNode tells DataNodes where to store blocks.

Check block locations

text

hdfs fsck /movies/file.jsonl -files -blocks -locations

NameNode returns block metadata.

13. HDFS vs Linux Filesystem

Feature	Linux FS	HDFS
Use case	Local storage	Distributed big data storage
File size	MB–GB	GB–TB–PB
Fault tolerance	No	Yes (replication)
Data blocks	No	Yes
Single machine	Yes	No
Append	Limited	Supported
Streaming large reads	Not optimized	Optimized

HDFS is not a replacement for Linux FS.

You use HDFS for analytics datasets, not OS files.