Spark Setup

3 min readEdit on GitHub

Spark Setup

We follow these principles:
  • Download Spark in a temporary folder
  • Install Spark permanently in /opt/spark
  • Configure environment variables
  • Validate Spark installation
  • Validate Spark ↔ Hadoop integration
This is the exact setup used in real Hadoop + Spark environments.
If you are in MAC you don't have to do the given steps below as it was already configured in [Hadoop_Installation(for Mac).md](../Hadoop_Installation(for Mac).md) file. Just navigate to (7th Point in this file to test the connectivity and working)

1. Download Spark in a Temporary Folder (/tmp)

Why /tmp?

  • It is a safe, temporary workspace
  • Downloading in /tmp avoids cluttering your home directory
  • If the download gets corrupted, deleting it is safe

Command

hljs bash
cd /tmp

2. Download Spark (Hadoop 3 Compatible Build)

hljs bash
wget https://downloads.apache.org/spark/spark-4.0.1/spark-4.0.1-bin-hadoop3.tgz

Explanation

  • wget downloads files from the internet.
  • This is the official Apache Spark 4.x version built for Hadoop 3.
  • The file will be stored in /tmp.

3. Create the Final Installation Directory

Spark should NOT be installed inside /tmp.
Instead, we place it in a permanent system location:
hljs bash
cd
sudo mkdir -p /opt/spark

Explanation

  • /opt is used for manually installed software (industry standard)
  • p creates the folder if it doesn’t exist
  • sudo required because /opt belongs to root

4. Extract Spark Into the Installation Folder

hljs bash
sudo tar -xzf /tmp/spark-4.0.1-bin-hadoop3.tgz -C /opt/spark --strip-components=1

Explanation

  • tar -xzf extracts a .tgz archive
  • C /opt/spark tells tar to extract inside that folder
  • -strip-components=1 removes the top-level internal folder
(so files are placed cleanly directly under /opt/spark)
After this step:
  • your Spark installation = /opt/spark

5. Give Yourself Ownership of Spark Folder

hljs bash
sudo chown -R $(whoami):$(whoami) /opt/spark

Why this is necessary

  • /opt/spark belongs to root
  • You want to run Spark without sudo
  • This command gives you full control over the folder

6. Add Spark + Hadoop Variables to .bashrc

Edit your .bashrc:
hljs bash
nano ~/.bashrc
Add these lines at the bottom:
hljs bash
# Spark Setup
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin

# Hadoop Config (required for HDFS access)
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Explanation

  • SPARK_HOME tells Linux where Spark is installed
  • Adding Spark’s bin folder to PATH allows running:
  • spark-shell
  • pyspark
  • spark-submit
  • HADOOP_CONF_DIR is critical
  • Spark uses Hadoop’s config files to access HDFS and YARN
  • Without it, Spark cannot read hdfs:/// paths
Reload .bashrc:
hljs bash
source ~/.bashrc

7. Validate Spark Installation

Check version:
hljs bash
spark-shell --version
If Spark prints version → installation is correct.

8. Launch Spark Shell to Confirm It Starts

hljs bash
spark-shell
If the interactive shell opens, Spark is successfully installed.

9. Test Spark ↔ HDFS connection

Inside spark-shell:
  • For Windows
hljs scala
val data = sc.textFile("hdfs:///data/data.csv")
data.take(5)
  • For Mac
hljs scala
val data = sc.textFile("hdfs://localhost:9000/<foldername>/data.csv")
data.take(5)
# check the <foldername> using "hdfs dfs -ls /"
# you'll see something like "/nitish/data.csv" or "/data.csv"
If this prints lines from HDFS → Spark is now replacing MapReduce.