HDFS Overview
5 min readEdit on GitHub
HDFS Overview
HDFS is the storage layer of Hadoop.
Its job is simple to say, but massive to implement:
Store huge datasets reliably across many machines,
and let processing frameworks (MapReduce, Spark, Hive) read/write efficiently.
1. Why HDFS Exists
Traditional local filesystems (ext4, NTFS) are not made for:
- Very large files (GB–TB)
- Distributed clusters
- Automatic replication
- Streaming large files for analytics
A single machine = single point of failure.
If that machine dies, your data dies.
HDFS solves that by:
- Splitting files into blocks
- Storing blocks across many machines
- Keeping replicas automatically
- Recovering from failures automatically
This is why Big Data systems use HDFS.
2. HDFS Architecture Overview
Very simple to visualize:
NameNode
- Stores metadata (not actual data)
- File names, permissions, block locations
- Keeps mapping:
/movies/moviesdata.jsonl → block1 on DN1, block2 on DN3, block3 on DN2DataNode
- Stores the actual data blocks
- Sends heartbeat to NameNode
- Sends block reports regularly
3. How HDFS Stores Files Internally (Blocks)
This is the core of HDFS.
Default block size:
128 MB (older versions had 64MB)
When you upload a file:
text
Suppose file size = 350MB
Block size = 128MBHDFS breaks it like this:
text
Block 1 = 128 MB
Block 2 = 128 MB
Block 3 = 94 MBBlocks are distributed across DataNodes.
HDFS never stores the entire file on a single machine.
That’s why large datasets become manageable.
4. Replication (Fault Tolerance)
Default replication factor = 3
Meaning each block is stored 3 times on 3 different DataNodes.
So:
text
Block 1 → DN1, DN2, DN3
Block 2 → DN3, DN4, DN1
Block 3 → DN2, DN4, DN5If one DataNode goes down, you still have two copies.
NameNode constantly ensures replication.
If a block replica is lost, NameNode automatically creates a new replica somewhere else.
5. Write Pipeline (How data enters HDFS)
This is important for deep understanding.
When you run:
text
hdfs dfs -put file.txt /movies/Process:
- Client asks NameNode: "I want to write a file."
- NameNode replies with:
"Here are 3 DataNodes for the first block."
- Client sends block data to DataNode 1.
- DataNode 1 streams it to DataNode 2.
- DataNode 2 streams it to DataNode 3.
- DataNode 3 sends acknowledgment back up the chain.
- Client asks for next block location.
So data flows like a pipeline:
text
Client → DN1 → DN2 → DN3This is extremely efficient.
6. Read Pipeline (How HDFS returns data)
When you run:
text
hdfs dfs -cat /movies/file.txtProcess:
- Client asks NameNode: "Where are the blocks?"
- NameNode sends list of block locations.
- Client picks the nearest DataNode.
- Reads block1, block2, block3 sequentially.
- Reconstructs file on the client side.
NameNode does NOT supply data.
Only DataNodes do.
That’s why NameNode is light and DataNodes handle heavy IO.
7. NameNode Internals (Very important)
NameNode stores:
1. File system namespace
- Directory structure
- File names
- Permissions
2. Block map
Mapping of blockID → DataNode list
3. Two critical files:
- fsimage (checkpointed metadata snapshot)
- edits (log of recent filesystem operations)
Combined, these define the entire HDFS namespace.
NameNode memory
Metadata is held in RAM for speed.
This is why NameNode needs high memory.
8. DataNode Internals
DataNodes store actual data in block form inside:
text
dfs.data.dirEach block is a file on Linux filesystem.
DataNode sends 2 kinds of information regularly:
1. Heartbeats (every 3 seconds)
"Yes, I'm alive."
2. Block reports
"What blocks I have."
If DataNode stops sending heartbeat for 10 minutes:
- NameNode marks it as dead
- Re-replicates blocks elsewhere
9. Block Placement Strategy
HDFS places replicas intelligently:
Replication factor = 3
Usual placement:
- One replica on local rack
- One on a different rack
- One on same rack as the second
This avoids data loss even if an entire rack fails.
10. Heartbeats
Heartbeat = health signal.
If heartbeat missing:
- DataNode declared dead
- Blocks on it are considered lost
- NameNode replicates from remaining copies
11. Rack Awareness
Hadoop cluster has racks of machines.
HDFS knows which DataNode is in which rack.
Goal:
- Avoid putting all replicas in the same rack
- Reduce network traffic
- Improve fault tolerance
Rack awareness is critical in large clusters.
12. HDFS Commands in Context
Commands you use daily interact with NameNode and DataNode.
Example:
Create directory
text
hdfs dfs -mkdir /moviesNameNode updates namespace.
Upload file
text
hdfs dfs -put file.jsonl /moviesNameNode tells DataNodes where to store blocks.
Check block locations
text
hdfs fsck /movies/file.jsonl -files -blocks -locationsNameNode returns block metadata.
13. HDFS vs Linux Filesystem
| Feature | Linux FS | HDFS |
|---|---|---|
| Use case | Local storage | Distributed big data storage |
| File size | MB–GB | GB–TB–PB |
| Fault tolerance | No | Yes (replication) |
| Data blocks | No | Yes |
| Single machine | Yes | No |
| Append | Limited | Supported |
| Streaming large reads | Not optimized | Optimized |
HDFS is not a replacement for Linux FS.
You use HDFS for analytics datasets, not OS files.