MapReduce Things

5 min readEdit on GitHub

MapReduce Things

1. WordCount

WordCount is the simplest and most important MapReduce example. It demonstrates how MapReduce processes text and counts how many times each word appears.

What WordCount Does

  • Reads text files
  • Breaks each line into words
  • Emits (word, 1) from the mapper
  • Reducer adds the counts
  • Produces final output as word → frequency

Why It Is Important

  • Teaches the MapReduce flow: Map → Shuffle → Reduce
  • Shows distributed text processing
  • Works for logs, CSV, JSON, and plain text

How to Run WordCount

Input file example in HDFS:
text
/data/data.csv
Commands:
  • For Windows
text
hdfs dfs -rm -r /data/output_wc
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar wordcount /data/data.csv /data/output_wc
  • For Mac
text
hdfs dfs -rm -r /data/output_wc
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar wordcount /data/data.csv /data/output_wc
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again
View output:
text
hdfs dfs -cat /data/output_wc/part-r-00000 | head

2. Grep

Grep searches for a specific pattern inside a distributed text file stored in HDFS.

What Grep Does

  • Scans each line
  • Looks for a pattern (example: "error")
  • Outputs how many lines match the pattern
  • Useful for searching logs at scale

Why It Is Important

  • Demonstrates pattern matching in distributed mode
  • Helps with log analysis
  • Good for filtering large datasets

How to Run Grep

Search for "error" inside your CSV:
  • For Windows
text
hdfs dfs -rm -r /data/grep_output
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar grep /data/data.csv /data/grep_output "error"
  • For Mac
text
hdfs dfs -rm -r /data/grep_output
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar grep /data/data.csv /data/grep_output "error"
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again
See results:
text
hdfs dfs -cat /data/grep_output/part-r-00000

3. Sort

Sort performs a full distributed sort on your data using MapReduce.

What Sort Does

  • Mapper reads lines and emits them as keys
  • Shuffle sorts keys globally
  • Reducers output fully sorted data
  • Useful for teaching shuffle and global ordering

Why It Is Important

  • Shows how MapReduce can sort massive datasets
  • Tests the shuffle/sort performance of the cluster

How to Run Sort

  • For Windows
text
hdfs dfs -rm -r /data/sort_output
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar sort /data/data.csv /data/sort_output
  • For Mac
text
hdfs dfs -rm -r /data/sort_output
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar sort /data/data.csv /data/sort_output
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again
View sorted output:
text
hdfs dfs -cat /data/sort_output/part-r-00000 | head

4. TeraSort Suite

The TeraSort suite includes three programs:
  1. TeraGen
  2. TeraSort
  3. TeraValidate
This suite is used worldwide for Hadoop cluster benchmarking.

4.1 TeraGen

TeraGen generates large synthetic data (GBs or TBs) used for sorting tests.

What It Does

  • Creates high-volume random key/value data
  • Used to test HDFS write speed
  • Input for TeraSort

Command Example

Generate data in HDFS:
  • For Windows
text
hdfs dfs -rm -r /data/teragen
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar teragen 1000000 /data/teragen
  • For Mac
text
hdfs dfs -rm -r /data/teragen
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar teragen 1000000 /data/teragen
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again
(Here 1,000,000 = number of rows. Adjust as needed.)

4.2 TeraSort

TeraSort sorts the massive dataset generated by TeraGen.

What It Does

  • Performs a high-performance distributed sort
  • Tests cluster capability
  • Validates Hadoop shuffle and partitioning

Command

  • For Windows
text
hdfs dfs -rm -r /data/terasort_out
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar terasort /data/teragen /data/terasort_out
  • For Mac
text
hdfs dfs -rm -r /data/terasort_out
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar terasort /data/teragen /data/terasort_out

#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again

4.3 TeraValidate

Verifies that the sorted output of TeraSort is correct.

Command

  • For Windows
text
hdfs dfs -rm -r /data/teravalidate
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar teravalidate /data/terasort_out /data/teravalidate
  • For Mac
text
hdfs dfs -rm -r /data/teravalidate
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar teravalidate /data/terasort_out /data/teravalidate
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again

5. RandomWriter

RandomWriter generates large volumes of random data in HDFS.

What RandomWriter Does

  • Writes massive random key/value records
  • Does not read input
  • Used for testing HDFS I/O performance
  • Often used before TeraSort to generate input

Why It Exists

  • For cluster stress testing
  • For generating benchmark datasets
  • For demonstrating parallel data generation

How to Run RandomWriter

  • For Windows
text
hdfs dfs -rm -r /data/random_data
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar randomwriter /data/random_data
  • For Mac
text
hdfs dfs -rm -r /data/random_data
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar randomwriter /data/random_data
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again
This generates very large files.
Be careful with storage space.

6. Pi Estimator

This program estimates the value of π using the Monte Carlo method.

What It Does

  • Runs multiple map tasks
  • Each mapper generates random points
  • Reducer aggregates the results
  • Computes an approximation of π

Why It Is Important

  • Shows how mathematical computations can scale
  • Demonstrates parallel simulation
  • Useful for understanding statistical MapReduce jobs

How to Run Pi

  • For Windows
text
hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar pi 5 1000
  • For Mac
text
hadoop jar /opt/homebrew/Cellar/hadoop/3.4.2/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.2.jar pi 5 1000
#if this doesn't work then change the path "/opt/homebrew/opt/hadoop/libexec/share/hadoop/mapreduce/" as given in inverted commas and run it again
Explanation:
5 = number of map tasks
1000 = number of samples per task