MapReduce Things

1. WordCount

WordCount is the simplest and most important MapReduce example. It demonstrates how MapReduce processes text and counts how many times each word appears.

What WordCount Does

Reads text files
Breaks each line into words
Emits (word, 1) from the mapper
Reducer adds the counts
Produces final output as word → frequency

Why It Is Important

Teaches the MapReduce flow: Map → Shuffle → Reduce
Shows distributed text processing
Works for logs, CSV, JSON, and plain text

How to Run WordCount

Input file example in HDFS:

text

/data/data.csv

Commands:

For Windows

text

hdfs dfs -rm -r /data/output_wc
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar wordcount /data/data.csv /data/output_wc

For Mac

text

hdfs dfs -rm -r /data/output_wc
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar wordcount /data/data.csv /data/output_wc
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again

View output:

text

hdfs dfs -cat /data/output_wc/part-r-00000 | head

2. Grep

Grep searches for a specific pattern inside a distributed text file stored in HDFS.

What Grep Does

Scans each line
Looks for a pattern (example: "error")
Outputs how many lines match the pattern
Useful for searching logs at scale

Why It Is Important

Demonstrates pattern matching in distributed mode
Helps with log analysis
Good for filtering large datasets

How to Run Grep

Search for "error" inside your CSV:

For Windows

text

hdfs dfs -rm -r /data/grep_output
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar grep /data/data.csv /data/grep_output "error"

For Mac

text

hdfs dfs -rm -r /data/grep_output
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar grep /data/data.csv /data/grep_output "error"
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again

See results:

text

hdfs dfs -cat /data/grep_output/part-r-00000

3. Sort

Sort performs a full distributed sort on your data using MapReduce.

What Sort Does

Mapper reads lines and emits them as keys
Shuffle sorts keys globally
Reducers output fully sorted data
Useful for teaching shuffle and global ordering

Why It Is Important

Shows how MapReduce can sort massive datasets
Tests the shuffle/sort performance of the cluster

How to Run Sort

For Windows

text

hdfs dfs -rm -r /data/sort_output
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar sort /data/data.csv /data/sort_output

For Mac

text

hdfs dfs -rm -r /data/sort_output
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar sort /data/data.csv /data/sort_output
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again

View sorted output:

text

hdfs dfs -cat /data/sort_output/part-r-00000 | head

4. TeraSort Suite

The TeraSort suite includes three programs:

TeraGen
TeraSort
TeraValidate

This suite is used worldwide for Hadoop cluster benchmarking.

4.1 TeraGen

TeraGen generates large synthetic data (GBs or TBs) used for sorting tests.

What It Does

Creates high-volume random key/value data
Used to test HDFS write speed
Input for TeraSort

Command Example

Generate data in HDFS:

For Windows

text

hdfs dfs -rm -r /data/teragen
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar teragen 1000000 /data/teragen

For Mac

text

hdfs dfs -rm -r /data/teragen
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar teragen 1000000 /data/teragen
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again

(Here 1,000,000 = number of rows. Adjust as needed.)

4.2 TeraSort

TeraSort sorts the massive dataset generated by TeraGen.

What It Does

Performs a high-performance distributed sort
Tests cluster capability
Validates Hadoop shuffle and partitioning

Command

For Windows

text

hdfs dfs -rm -r /data/terasort_out
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar terasort /data/teragen /data/terasort_out

For Mac

text

hdfs dfs -rm -r /data/terasort_out
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar terasort /data/teragen /data/terasort_out

#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again

4.3 TeraValidate

Verifies that the sorted output of TeraSort is correct.

Command

For Windows

text

hdfs dfs -rm -r /data/teravalidate
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar teravalidate /data/terasort_out /data/teravalidate

For Mac

text

hdfs dfs -rm -r /data/teravalidate
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar teravalidate /data/terasort_out /data/teravalidate
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again

5. RandomWriter

RandomWriter generates large volumes of random data in HDFS.

What RandomWriter Does

Writes massive random key/value records
Does not read input
Used for testing HDFS I/O performance
Often used before TeraSort to generate input

Why It Exists

For cluster stress testing
For generating benchmark datasets
For demonstrating parallel data generation

How to Run RandomWriter

For Windows

text

hdfs dfs -rm -r /data/random_data
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar randomwriter /data/random_data

For Mac

text

hdfs dfs -rm -r /data/random_data
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar randomwriter /data/random_data
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again

This generates very large files.

Be careful with storage space.

6. Pi Estimator

This program estimates the value of π using the Monte Carlo method.

What It Does

Runs multiple map tasks
Each mapper generates random points
Reducer aggregates the results
Computes an approximation of π

Why It Is Important

Shows how mathematical computations can scale
Demonstrates parallel simulation
Useful for understanding statistical MapReduce jobs

How to Run Pi

For Windows

text

hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar pi 5 1000

For Mac

text

hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar pi 5 1000
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again

Explanation:

5 = number of map tasks

1000 = number of samples per task