Big Data Survival Guide

Spark Overview

6 min readEdit on GitHub

Spark Overview

Apache Spark is a modern, high-performance, distributed data processing engine designed to work with big data. It allows developers to process large datasets quickly using parallel computation across clusters.

Spark was created to overcome the limitations of MapReduce, especially its slow, disk-heavy processing model. Spark performs most operations in memory, making it dramatically faster, easier to use, and suitable for a broader range of applications, including streaming, machine learning, graph processing, and SQL analytics.

Spark is widely used in companies for large-scale data analytics, real-time pipelines, ETL, and machine learning workflows.

Why Spark is Needed

Big data creates challenges that single-machine programs cannot handle:

Data volumes exceed RAM and disk limits of one computer
Processing must run in parallel
Jobs must be fault tolerant
Processing must support batch, streaming, and ML workloads
Developers need simple APIs (SQL, Python, DataFrames)

Spark meets all of these requirements using a distributed cluster architecture.

Spark vs MapReduce

Feature	Spark	MapReduce
Processing model	In-memory	Disk-based
Speed	10x to 100x faster	Slow because of constant disk I/O
Ease of use	High (Python, Scala, SQL, DataFrames)	Low (Java-heavy, verbose)
Real-time processing	Yes (Structured Streaming)	No
Machine learning support	Built-in MLlib	Not built-in
Iterative algorithms	Very fast	Very slow
API flexibility	SQL, DataFrames, RDD, Streaming	Only Map/Reduce paradigm
Fault tolerance	Yes	Yes
Usage today	Very high	Mostly legacy

Key insight:

MapReduce writes intermediate results to HDFS after every step, which slows it down. Spark keeps data in memory, avoiding repeated disk operations.

Understanding Spark at a High Level

Spark operates on clusters of machines. It distributes data across multiple nodes and executes processing tasks in parallel.

Spark consists of three main actors:

Driver Program

This is the main application that you write.
It creates SparkSessions, builds execution plans, and coordinates workers.

Cluster Manager

Allocates resources to Spark.
Examples: YARN, Kubernetes, Standalone.

Executors

Worker processes that run tasks.
They perform actual data processing.
They store data in memory for fast reuse.

Spark is designed around the idea of data parallelism and distributed execution.

Key Concepts in Spark

1. RDD (Resilient Distributed Dataset)

RDD is the fundamental data structure in Spark, representing a distributed collection of elements.

Characteristics:

Immutable
Distributed across cluster
Partitioned
Fault tolerant
Supports functional transformations like map, filter, reduce

RDDs are important for low-level control but are not commonly used in modern Spark compared to DataFrames.

2. DataFrames

A DataFrame is a distributed table with named columns.

It is the most commonly used API in Spark.

Advantages:

Highly optimized
Easy to use (similar to Pandas SQL tables)
Uses Catalyst Optimizer
Allows SQL queries
Best for production workloads

3. Dataset (Scala/Java only)

Strongly-typed version of DataFrames used in Scala/Java.

4. DAG (Directed Acyclic Graph)

Spark does not execute operations immediately.

Instead, it builds a DAG representing all transformations.

Two types of operations:

Transformations (lazy)

Examples: map, filter, groupBy

Actions (trigger execution)

Examples: show, count, collect

The DAG is submitted when an action is called.

5. Stages and Tasks

When a DAG is executed:

Spark breaks the DAG into stages
Stages are further divided into tasks
Tasks run in parallel across executors

Spark Architecture in Detail

1. Driver Program

Creates SparkSession
Builds logical and physical plans
Sends tasks to Cluster Manager
Handles overall coordination

2. Cluster Manager

Allocates CPU/RAM
Launches executors
Manages resource lifecycle

Types:

Standalone
YARN
Kubernetes
Mesos

3. Executors

Run tasks in parallel
Cache RDDs/DataFrames in memory
Return results to the driver

Spark Execution Model (Internal Lifecycle)

Step-by-step internal flow:

User writes Spark code (Python/Scala/SQL).
Driver creates a logical plan.
Catalyst Optimizer optimizes the logical plan.
Spark generates a physical plan.
DAG Scheduler breaks the plan into stages.
Tasks are created for each stage.
Cluster Manager allocates executors.
Executors run tasks in parallel.
Shuffle occurs if grouping/joining is needed.
Final results returned or written to storage.

Spark Ecosystem Components

Spark Core

Foundation of Spark: RDD, scheduling, memory management, fault tolerance.

Spark SQL

Provides DataFrames and SQL engine.

Includes Catalyst Optimizer and Tungsten Execution Engine.

Spark Streaming and Structured Streaming

Allows real-time data processing using micro-batches.

MLlib

Built-in machine learning library:

Classification
Regression
Clustering
Recommendation
Pipelines

GraphX

Graph computation framework (PageRank, graph analysis).

Spark with Hadoop

Spark does not replace Hadoop; it complements it.

Spark can:

Read data from HDFS
Run on YARN
Write results back to HDFS
Use Hadoop cluster resources

Spark + HDFS is a common architecture used in industry.

Why Spark Is So Fast

In-memory computation

Data stays in RAM instead of writing to disk repeatedly.

DAG optimization

Eliminates unnecessary operations and reorganizes tasks.

Catalyst Optimizer

Efficient query planning for SQL/DataFrames.

Tungsten Execution Engine

Highly optimized memory and CPU usage.

Reduced disk usage

Only spills to disk when necessary.

Spark API Layers (Beginner Overview)

1. RDD API

Low-level, functional programming style.

Good for custom logic and control.

2. DataFrame API

High-level, SQL-like operations.

Most commonly used.

3. Spark SQL

Execute SQL queries directly using Spark engine.

4. Streaming API

For real-time data streams.

Shuffling in Spark (Critical for Internals)

Shuffle happens in operations like:

groupBy
reduceByKey
join
sortBy

Shuffle involves:

moving data across executors
sorting
repartitioning

It is expensive and should be minimized.

Spark Deployment Modes

Local Mode

Runs on a single machine with multiple threads.

Standalone Mode

Spark manages its own cluster.

YARN Mode

Uses Hadoop’s resource manager.

Kubernetes Mode

Popular in cloud environments.

Fault Tolerance in Spark

Spark achieves fault tolerance via:

RDD lineage (recomputing lost partitions)
Replication (when caching)
Re-running failed tasks on other executors

Spark Beginner Roadmap

Understand Spark architecture
Learn RDD basics
Move to DataFrames
Learn Spark SQL
Practice joins, groupBy, aggregations
Understand shuffle
Learn caching and persistence
Practice reading/writing to HDFS
Learn basic Spark Streaming
Build a mini project