What Is Big Data

10 min readEdit on GitHub

What is Big Data

Big Data refers to extremely large and complex datasets that cannot be easily stored, processed, or analyzed using traditional data processing tools (like relational databases) or a single machine.
But Big Data is not only about “size”. It is a concept that describes data with the following challenges:
  • Too large to store on a single machine
  • Too fast to process with conventional tools
  • Too varied in structure/form to fit into old systems
In simple terms:
Anything that goes beyond the capacity of your current tools to efficiently store, manage, and analyze — THAT becomes Big Data.
Example:
A bank receives thousands of transactions every second, a social media platform collects millions of likes, comments, and videos per minute — that is Big Data.

Why “Big”? Is it only about size?

No. Big Data involves three major limitations:

1. Storage Limitation

If you try to store terabytes of data on a single machine, it’ll get slow, costly, and fail under load. Big Data systems (like Hadoop) store data across distributed clusters, not one system.

2. Processing Limitation

Traditional SQL systems follow a vertical scaling model (bigger machines). Big Data processing runs in parallel across many cheaper machines — this is called horizontal scaling.

3. Analysis Limitation

Traditional DBs are limited to structured data and require schema beforehand. Big Data systems support structured, semi-structured, and unstructured data — meaning you can store logs, social media text, images, videos, anything together.

Key Characteristics of Big Data (Expanding the 5 Vs)

1. Volume

  • Data size is enormous, ranging from terabytes to petabytes.
  • Example: Facebook generates over 4 petabytes of data per day.

2. Velocity

  • Data flows in at high speed — sometimes in milliseconds.
  • Example: Stock market feeds, sensor data, GPS events.

3. Variety

  • Data comes in many formats:
  • Structured: Tables, spreadsheets
  • Semi-structured: JSON files, XML
  • Unstructured: Images, videos, text, click logs

4. Veracity

  • Data can be messy or incomplete.
  • Big Data systems must identify and clean junk, duplicates, spam.

5. Value

  • The purpose of handling Big Data is to derive business value that drives decision making:
  • Predicting customer behavior
  • Detecting fraud
  • Generating personalized recommendations

How Traditional and Big Data Systems Differ

FeatureTraditional SystemsBig Data Systems
Data ScaleMBs to GBsTBs to PBs and beyond
HardwareSingle server, vertical scalingCommodity cluster, horizontal scaling
Data TypeStructured onlyAny type (structured, semi, unstructured)
ProcessingSerial / centralizedParallel / distributed
Query ExecutionEverything processed upfrontLazy evaluation, on-demand

Where Does Big Data Come From?

  • Social Media: Likes, comments, shares
  • IoT Devices: Sensors, smart homes, wearables
  • Logs: App logs, server logs, clickstreams
  • E-Commerce: User sessions, checkout, click patterns
  • Healthcare: Patient data, scans, trackers
  • Finance: ATM transactions, trading data

Simple Example of Big Data

Imagine Flipkart during a Big Billion Day sale:
  • Millions of users browse and order simultaneously.
  • Each click, add-to-cart, payment, delivery update generates data.
  • They need systems that can:
  • Store massive data instantly
  • Process user behavior real-time
  • Show personalized offers
  • Detect fraudulent transactions
Traditional tools cannot handle this scale and dynamic, so Big Data architecture is used.

Big Data in One Line

Big Data is your massive, fast-moving, varied, raw data that builds powerful insights through distributed storage and parallel processing systems.

Big Data Technologies and Tools Overview

Big Data solutions are not a single software — they are built using multiple tools working together. Think of it like an organised system with different layers and tools doing different jobs like storing, processing, movement, querying, and visualization of massive datasets.
Let’s break it down into logical layers that make sense:

1. Data Storage Technologies

Big Data requires a special kind of storage that is:
  • Distributed (data is stored in chunks across multiple machines)
  • Fault-tolerant (if one machine fails, you don’t lose anything)
  • Scalable (you can add machines anytime to grow storage)

Main Technologies:

HDFS (Hadoop Distributed File System)

  • The backbone of most on-premise Big Data systems
  • Stores very large files across multiple nodes
  • Automatically replicates files for fault-tolerance

S3 (Amazon Simple Storage Service)

  • Cloud storage service used widely for Big Data storage
  • Durable and scalable — commonly used with modern Big Data processing tools

Google Cloud Storage, Azure Blob Storage

  • Same concept as S3 but by Google and Microsoft respectively
  • Support integration with cloud Big Data processing tools

2. Data Processing Frameworks

This is where raw data is transformed, analyzed, aggregated, and prepared for insights.

Apache Hadoop MapReduce

  • One of the first Big Data processing frameworks
  • Works by breaking tasks into smaller subtasks and running them in parallel
  • Comparatively slow due to disk-based processing

Apache Spark

  • Most popular Big Data processing engine today
  • In-memory processing (much faster than MapReduce)
  • Can handle machine learning, real-time data, and SQL operations
  • Supports multiple languages: Python, Java, Scala

Apache Hive

  • Provides SQL-like interface for Big Data
  • Used to perform batch queries on large datasets stored in HDFS or object storage
  • Not for real-time; used majorly for data warehousing

Apache Pig

  • Scripting platform for data transformations on Hadoop
  • Uses a language called Pig Latin for ETL operations
  • Less used today, replaced by Spark and Hive in most setups

3. Real-Time Data Ingestion and Processing

Data that comes in continuously (like logs, IoT sensor data, streaming data) needs special tools:

Apache Kafka

  • Distributed messaging and data streaming platform
  • Acts like a backbone to send and receive data between systems
  • Handles millions of events per second
  • Used by Netflix, Uber, LinkedIn, etc.

Apache Spark Streaming

  • Extension of Spark
  • Allows real-time data processing in small micro-batches
  • Can pull from Kafka, sockets, etc. and run transformations
  • Designed for true real-time processing (event-by-event)
  • Low-latency, highly scalable

4. NoSQL Databases (Non-Relational Databases)

These aren't like SQL databases. They’re designed to handle schema-less, unstructured, or semi-structured data at scale with fast read/write operations.

MongoDB

  • Document-based database (stores JSON-like documents)
  • Flexible schema
  • Good for real-time apps and analytics databases

Apache Cassandra

  • Highly available, scalable wide-column store
  • Runs across multiple data centers very easily
  • Best for write-heavy workloads (e.g., logging systems)

Redis

  • In-memory data store, used for caching and fast lookups
  • Useful when you need very quick access

5. Query and Data Exploration Tools

These tools help run SQL-like queries on Big Data without writing complex code:

Presto (now Trino)

  • Distributed SQL query engine
  • Can query data from HDFS, S3, Kafka, Cassandra — all in one place
  • Very fast, used by Facebook, Netflix, Uber

Apache Impala

  • SQL engine built for Hadoop
  • Supports interactive queries on large datasets

Apache Drill

  • Schema-free SQL engine
  • You can query directly on files like JSON, Parquet without ETL

6. Machine Learning and Analytics Tools

Apache Spark MLlib

  • Machine learning library within Spark
  • Offers scalable ML algorithms like regression, clustering, etc.

TensorFlow + Big Data

  • Can be used with Big Data pipelines for deep learning workloads

H2O.ai

  • Built for scalable machine learning on Big Data systems

7. Big Data Orchestration & Workflow Management

To automate workflows and manage data pipelines:

Apache Airflow

  • Workflow scheduler and orchestrator
  • Helps automate ETL pipelines, tasks, and monitoring

Azkaban, Oozie

  • DAG-based workflow engines for Hadoop pipelines

8. Visualization and Reporting Tools

After processing Big Data, you need dashboards and reports:

Tableau, Power BI

  • Business intelligence dashboard tools
  • Connect well with Big Data systems through connectors

Apache Superset

  • Open-source alternative to Tableau
  • Works on SQL databases, Presto, Hive, and others

Putting it All Together — A Big Data Stack Example

A real-world Big Data system may have:
text
Data Sources -> Kafka -> Spark -> S3 -> Hive -> Presto -> Tableau
Each piece plays a role — one stores, one analyzes, one queries, one displays.

Why Traditional Systems Can't Handle Big Data?

Traditional systems (like relational databases, single server setups, or local storage) are built for structured, moderate-sized data on single machines. But Big Data systems need to handle extremely large, fast, and varied data — across multiple machines, and that’s where traditional systems fail.
Let’s understand the limitations point by point.

1. Limited Storage Capacity

Traditional systems are usually hosted on a single server or vertically scaled machines. But Big Data requires terabytes to petabytes of storage — and traditional machines just can't handle this scale.
Example:
  • A typical SQL server might handle up to a few GBs before performance drops.
  • What if a government stores 10 years of CCTV footage or medical scans for millions of patients? Traditional systems can't store this reliably.
Big Data systems like HDFS store data distributed across many machines, so storage grows as needed.

2. Cannot Handle Data Variety

Traditional databases (like MySQL, Oracle) are made for structured data — tables with fixed columns and rows.
Big Data includes:
  • Structured data (tables)
  • Semi-structured data (JSON, XML)
  • Unstructured data (videos, logs, images, audio, social media text)
Example:
  • Instagram stores pictures, videos, comments, tags, messages — all kinds of data. Traditional databases can't uniformly store this chaos.
Big Data tools like MongoDB, HDFS, and Kafka can handle any type of data.

3. Slow Processing for Large Data

Traditional RDBMS process data in a single-threaded, monolithic fashion. They ingest and process row by row.
Big Data requires:
  • Parallel processing across thousands of nodes
  • Fast in-memory computation (not disk-based)
Example:
  • A bank analyzing 10 years of transactions to detect fraud patterns would take days with traditional systems.
  • With Big Data (Spark + distributed storage), the same task finishes in minutes.

4. Scalability Is Hard and Expensive

Traditional systems scale vertically — i.e., you need bigger servers (more RAM, CPU, SSD) which becomes increasingly costly and infeasible.
Big Data systems scale horizontally — you just add more commodity machines to the cluster and you're good.
Example:
  • Want to handle double the data load? Traditional: Buy a $50,000 server.
  • Big Data: Add new $500 nodes to the existing cluster. Easy and cheap.
Big Data = Scale out, not scale up.

5. Real-Time Processing Limitations

Most traditional systems are designed for batch processing — i.e., process when data is available.
Big Data requires real-time analytics:
  • Fraud detection in banks
  • Traffic updates on Google Maps
  • Stock price prediction
Traditional systems can't ingest and process streaming data as it arrives. That’s why tools like Kafka + Spark Streaming + Flink exist.

6. High Costs and Maintenance

Traditional database licensing and scaling costs are extremely high. There’s a limit after which adding resources doesn’t improve performance significantly.
Big Data systems often use open-source, commodity hardware and cloud storage — making them cheaper to build and maintain.

Real-Life Systems That Cannot Use Traditional Approaches

Use CaseWhy Traditional Fails
Netflix RecommendationHuge volume, needs real-time processing
Uber Ride AllocationNeeds millisecond-level matching logic
Flipkart Sale DayMillions of transactions, massive logs per second
Smart City IoT sensorsContinuous data feed from thousands of devices