Spark Overview
6 min readEdit on GitHub
Spark Overview
Apache Spark is a modern, high-performance, distributed data processing engine designed to work with big data. It allows developers to process large datasets quickly using parallel computation across clusters.
Spark was created to overcome the limitations of MapReduce, especially its slow, disk-heavy processing model. Spark performs most operations in memory, making it dramatically faster, easier to use, and suitable for a broader range of applications, including streaming, machine learning, graph processing, and SQL analytics.
Spark is widely used in companies for large-scale data analytics, real-time pipelines, ETL, and machine learning workflows.
Why Spark is Needed
Big data creates challenges that single-machine programs cannot handle:
- Data volumes exceed RAM and disk limits of one computer
- Processing must run in parallel
- Jobs must be fault tolerant
- Processing must support batch, streaming, and ML workloads
- Developers need simple APIs (SQL, Python, DataFrames)
Spark meets all of these requirements using a distributed cluster architecture.
Spark vs MapReduce
| Feature | Spark | MapReduce |
|---|---|---|
| Processing model | In-memory | Disk-based |
| Speed | 10x to 100x faster | Slow because of constant disk I/O |
| Ease of use | High (Python, Scala, SQL, DataFrames) | Low (Java-heavy, verbose) |
| Real-time processing | Yes (Structured Streaming) | No |
| Machine learning support | Built-in MLlib | Not built-in |
| Iterative algorithms | Very fast | Very slow |
| API flexibility | SQL, DataFrames, RDD, Streaming | Only Map/Reduce paradigm |
| Fault tolerance | Yes | Yes |
| Usage today | Very high | Mostly legacy |
Key insight:
MapReduce writes intermediate results to HDFS after every step, which slows it down. Spark keeps data in memory, avoiding repeated disk operations.
Understanding Spark at a High Level
Spark operates on clusters of machines. It distributes data across multiple nodes and executes processing tasks in parallel.
Spark consists of three main actors:
- Driver Program
- This is the main application that you write.
- It creates SparkSessions, builds execution plans, and coordinates workers.
- Cluster Manager
- Allocates resources to Spark.
- Examples: YARN, Kubernetes, Standalone.
- Executors
- Worker processes that run tasks.
- They perform actual data processing.
- They store data in memory for fast reuse.
Spark is designed around the idea of data parallelism and distributed execution.
Key Concepts in Spark
1. RDD (Resilient Distributed Dataset)
RDD is the fundamental data structure in Spark, representing a distributed collection of elements.
Characteristics:
- Immutable
- Distributed across cluster
- Partitioned
- Fault tolerant
- Supports functional transformations like map, filter, reduce
RDDs are important for low-level control but are not commonly used in modern Spark compared to DataFrames.
2. DataFrames
A DataFrame is a distributed table with named columns.
It is the most commonly used API in Spark.
Advantages:
- Highly optimized
- Easy to use (similar to Pandas SQL tables)
- Uses Catalyst Optimizer
- Allows SQL queries
- Best for production workloads
3. Dataset (Scala/Java only)
Strongly-typed version of DataFrames used in Scala/Java.
4. DAG (Directed Acyclic Graph)
Spark does not execute operations immediately.
Instead, it builds a DAG representing all transformations.
Two types of operations:
- Transformations (lazy)
Examples: map, filter, groupBy
- Actions (trigger execution)
Examples: show, count, collect
The DAG is submitted when an action is called.
5. Stages and Tasks
When a DAG is executed:
- Spark breaks the DAG into stages
- Stages are further divided into tasks
- Tasks run in parallel across executors
Spark Architecture in Detail
1. Driver Program
- Creates SparkSession
- Builds logical and physical plans
- Sends tasks to Cluster Manager
- Handles overall coordination
2. Cluster Manager
- Allocates CPU/RAM
- Launches executors
- Manages resource lifecycle
Types:
- Standalone
- YARN
- Kubernetes
- Mesos
3. Executors
- Run tasks in parallel
- Cache RDDs/DataFrames in memory
- Return results to the driver
Spark Execution Model (Internal Lifecycle)
Step-by-step internal flow:
- User writes Spark code (Python/Scala/SQL).
- Driver creates a logical plan.
- Catalyst Optimizer optimizes the logical plan.
- Spark generates a physical plan.
- DAG Scheduler breaks the plan into stages.
- Tasks are created for each stage.
- Cluster Manager allocates executors.
- Executors run tasks in parallel.
- Shuffle occurs if grouping/joining is needed.
- Final results returned or written to storage.
Spark Ecosystem Components
Spark Core
Foundation of Spark: RDD, scheduling, memory management, fault tolerance.
Spark SQL
Provides DataFrames and SQL engine.
Includes Catalyst Optimizer and Tungsten Execution Engine.
Spark Streaming and Structured Streaming
Allows real-time data processing using micro-batches.
MLlib
Built-in machine learning library:
- Classification
- Regression
- Clustering
- Recommendation
- Pipelines
GraphX
Graph computation framework (PageRank, graph analysis).
Spark with Hadoop
Spark does not replace Hadoop; it complements it.
Spark can:
- Read data from HDFS
- Run on YARN
- Write results back to HDFS
- Use Hadoop cluster resources
Spark + HDFS is a common architecture used in industry.
Why Spark Is So Fast
- In-memory computation
Data stays in RAM instead of writing to disk repeatedly.
- DAG optimization
Eliminates unnecessary operations and reorganizes tasks.
- Catalyst Optimizer
Efficient query planning for SQL/DataFrames.
- Tungsten Execution Engine
Highly optimized memory and CPU usage.
- Reduced disk usage
Only spills to disk when necessary.
Spark API Layers (Beginner Overview)
1. RDD API
Low-level, functional programming style.
Good for custom logic and control.
2. DataFrame API
High-level, SQL-like operations.
Most commonly used.
3. Spark SQL
Execute SQL queries directly using Spark engine.
4. Streaming API
For real-time data streams.
Shuffling in Spark (Critical for Internals)
Shuffle happens in operations like:
- groupBy
- reduceByKey
- join
- sortBy
Shuffle involves:
- moving data across executors
- sorting
- repartitioning
It is expensive and should be minimized.
Spark Deployment Modes
- Local Mode
Runs on a single machine with multiple threads.
- Standalone Mode
Spark manages its own cluster.
- YARN Mode
Uses Hadoop’s resource manager.
- Kubernetes Mode
Popular in cloud environments.
Fault Tolerance in Spark
Spark achieves fault tolerance via:
- RDD lineage (recomputing lost partitions)
- Replication (when caching)
- Re-running failed tasks on other executors
Spark Beginner Roadmap
- Understand Spark architecture
- Learn RDD basics
- Move to DataFrames
- Learn Spark SQL
- Practice joins, groupBy, aggregations
- Understand shuffle
- Learn caching and persistence
- Practice reading/writing to HDFS
- Learn basic Spark Streaming
- Build a mini project