Spark Overview

6 min readEdit on GitHub

Spark Overview

Apache Spark is a modern, high-performance, distributed data processing engine designed to work with big data. It allows developers to process large datasets quickly using parallel computation across clusters.
Spark was created to overcome the limitations of MapReduce, especially its slow, disk-heavy processing model. Spark performs most operations in memory, making it dramatically faster, easier to use, and suitable for a broader range of applications, including streaming, machine learning, graph processing, and SQL analytics.
Spark is widely used in companies for large-scale data analytics, real-time pipelines, ETL, and machine learning workflows.

Why Spark is Needed

Big data creates challenges that single-machine programs cannot handle:
  • Data volumes exceed RAM and disk limits of one computer
  • Processing must run in parallel
  • Jobs must be fault tolerant
  • Processing must support batch, streaming, and ML workloads
  • Developers need simple APIs (SQL, Python, DataFrames)
Spark meets all of these requirements using a distributed cluster architecture.

Spark vs MapReduce

FeatureSparkMapReduce
Processing modelIn-memoryDisk-based
Speed10x to 100x fasterSlow because of constant disk I/O
Ease of useHigh (Python, Scala, SQL, DataFrames)Low (Java-heavy, verbose)
Real-time processingYes (Structured Streaming)No
Machine learning supportBuilt-in MLlibNot built-in
Iterative algorithmsVery fastVery slow
API flexibilitySQL, DataFrames, RDD, StreamingOnly Map/Reduce paradigm
Fault toleranceYesYes
Usage todayVery highMostly legacy
Key insight:
MapReduce writes intermediate results to HDFS after every step, which slows it down. Spark keeps data in memory, avoiding repeated disk operations.

Understanding Spark at a High Level

Spark operates on clusters of machines. It distributes data across multiple nodes and executes processing tasks in parallel.
Spark consists of three main actors:
  1. Driver Program
  • This is the main application that you write.
  • It creates SparkSessions, builds execution plans, and coordinates workers.
  1. Cluster Manager
  • Allocates resources to Spark.
  • Examples: YARN, Kubernetes, Standalone.
  1. Executors
  • Worker processes that run tasks.
  • They perform actual data processing.
  • They store data in memory for fast reuse.
Spark is designed around the idea of data parallelism and distributed execution.

Key Concepts in Spark

1. RDD (Resilient Distributed Dataset)

RDD is the fundamental data structure in Spark, representing a distributed collection of elements.
Characteristics:
  • Immutable
  • Distributed across cluster
  • Partitioned
  • Fault tolerant
  • Supports functional transformations like map, filter, reduce
RDDs are important for low-level control but are not commonly used in modern Spark compared to DataFrames.

2. DataFrames

A DataFrame is a distributed table with named columns.
It is the most commonly used API in Spark.
Advantages:
  • Highly optimized
  • Easy to use (similar to Pandas SQL tables)
  • Uses Catalyst Optimizer
  • Allows SQL queries
  • Best for production workloads

3. Dataset (Scala/Java only)

Strongly-typed version of DataFrames used in Scala/Java.

4. DAG (Directed Acyclic Graph)

Spark does not execute operations immediately.
Instead, it builds a DAG representing all transformations.
Two types of operations:
  • Transformations (lazy)
Examples: map, filter, groupBy
  • Actions (trigger execution)
Examples: show, count, collect
The DAG is submitted when an action is called.

5. Stages and Tasks

When a DAG is executed:
  • Spark breaks the DAG into stages
  • Stages are further divided into tasks
  • Tasks run in parallel across executors

Spark Architecture in Detail

1. Driver Program

  • Creates SparkSession
  • Builds logical and physical plans
  • Sends tasks to Cluster Manager
  • Handles overall coordination

2. Cluster Manager

  • Allocates CPU/RAM
  • Launches executors
  • Manages resource lifecycle
Types:
  • Standalone
  • YARN
  • Kubernetes
  • Mesos

3. Executors

  • Run tasks in parallel
  • Cache RDDs/DataFrames in memory
  • Return results to the driver

Spark Execution Model (Internal Lifecycle)

Step-by-step internal flow:
  1. User writes Spark code (Python/Scala/SQL).
  2. Driver creates a logical plan.
  3. Catalyst Optimizer optimizes the logical plan.
  4. Spark generates a physical plan.
  5. DAG Scheduler breaks the plan into stages.
  6. Tasks are created for each stage.
  7. Cluster Manager allocates executors.
  8. Executors run tasks in parallel.
  9. Shuffle occurs if grouping/joining is needed.
  10. Final results returned or written to storage.

Spark Ecosystem Components

Spark Core

Foundation of Spark: RDD, scheduling, memory management, fault tolerance.

Spark SQL

Provides DataFrames and SQL engine.
Includes Catalyst Optimizer and Tungsten Execution Engine.

Spark Streaming and Structured Streaming

Allows real-time data processing using micro-batches.

MLlib

Built-in machine learning library:
  • Classification
  • Regression
  • Clustering
  • Recommendation
  • Pipelines

GraphX

Graph computation framework (PageRank, graph analysis).

Spark with Hadoop

Spark does not replace Hadoop; it complements it.
Spark can:
  • Read data from HDFS
  • Run on YARN
  • Write results back to HDFS
  • Use Hadoop cluster resources
Spark + HDFS is a common architecture used in industry.

Why Spark Is So Fast

  1. In-memory computation
Data stays in RAM instead of writing to disk repeatedly.
  1. DAG optimization
Eliminates unnecessary operations and reorganizes tasks.
  1. Catalyst Optimizer
Efficient query planning for SQL/DataFrames.
  1. Tungsten Execution Engine
Highly optimized memory and CPU usage.
  1. Reduced disk usage
Only spills to disk when necessary.

Spark API Layers (Beginner Overview)

1. RDD API

Low-level, functional programming style.
Good for custom logic and control.

2. DataFrame API

High-level, SQL-like operations.
Most commonly used.

3. Spark SQL

Execute SQL queries directly using Spark engine.

4. Streaming API

For real-time data streams.

Shuffling in Spark (Critical for Internals)

Shuffle happens in operations like:
  • groupBy
  • reduceByKey
  • join
  • sortBy
Shuffle involves:
  • moving data across executors
  • sorting
  • repartitioning
It is expensive and should be minimized.

Spark Deployment Modes

  1. Local Mode
Runs on a single machine with multiple threads.
  1. Standalone Mode
Spark manages its own cluster.
  1. YARN Mode
Uses Hadoop’s resource manager.
  1. Kubernetes Mode
Popular in cloud environments.

Fault Tolerance in Spark

Spark achieves fault tolerance via:
  1. RDD lineage (recomputing lost partitions)
  2. Replication (when caching)
  3. Re-running failed tasks on other executors

Spark Beginner Roadmap

  1. Understand Spark architecture
  2. Learn RDD basics
  3. Move to DataFrames
  4. Learn Spark SQL
  5. Practice joins, groupBy, aggregations
  6. Understand shuffle
  7. Learn caching and persistence
  8. Practice reading/writing to HDFS
  9. Learn basic Spark Streaming
  10. Build a mini project