Unit I: Big Data Notes (English Summary)

1. Introduction to Big Data

Big Data refers to very large and complex datasets that cannot be processed by ordinary database systems. These datasets originate from social media, sensors, transactions, and logs.

Characteristics (5 Vs)

Volume – Extremely large amounts, from terabytes to zettabytes.
Velocity – Data arrives rapidly (real-time or near real-time).
Variety – Includes structured, semi-structured, and unstructured data.
Veracity – The quality and trustworthiness of data.
Value – Extracting useful information to support decision-making.

Example: Twitter generates millions of tweets daily (high velocity and volume). Video streaming apps produce petabytes of data (volume and variety).

2. Introduction to Big Data Platform

A Big Data platform is an environment where data is stored, managed, and analyzed. Key components include:

Distributed Storage (HDFS, S3): Splitting data across many machines.
Processing Frameworks (MapReduce, Spark): Parallel processing of data.
Data Ingestion Tools (Flume, Kafka): Bringing data into the system.
Query Engines (Hive, Impala): Providing SQL-like access.
Workflow Orchestration (Oozie, Airflow).

Real World Use: Netflix stores user activity logs in HDFS and processes them with Spark to recommend movies.

3. Challenges of Conventional Systems

Traditional RDBMS face issues with Big Data because of:

Scalability: Vertical scaling is expensive.
Performance: Queries slow down on large tables.
Data Variety: Rigid schemas make unstructured data hard.
Cost: Licensing and hardware expenses rise quickly.

Example: A bank’s RDBMS handles customer records well, but struggles when data from ATMs, mobile apps, and fraud sensors flood in.

4. Intelligent Data Analysis

This is the process of deriving insights, patterns, or predictions from data using statistics and machine learning.

Descriptive Analysis: What happened? (e.g., sales report)
Predictive Analysis: What will happen? (e.g., demand forecast)
Prescriptive Analysis: What should be done? (e.g., action suggestions)

Techniques: data mining, classification, clustering, regression, deep learning.

Example: E-commerce sites predict next product purchase (predictive). Fraud detection classifies transactions as legitimate or fraud.

5. Nature of Data

Types

Structured: Fixed schema (tables, CSV). Example: bank transactions.
Semi-structured: Some tags, flexible schema (JSON, XML). Example: log files.
Unstructured: No format (text, images, videos). Example: customer reviews.

Understanding data type guides storage and tool choices.

6. Analytic Processes and Tools

Pipeline steps:

Data Collection: Gather from sources.
Data Cleaning: Remove errors, handle missing values.
Data Transformation: Convert into usable format.
Modeling/Analysis: Apply statistical or ML models.
Evaluation: Check model results.
Deployment/Reporting: Share results via dashboards or reports.

Common Tools

Hadoop Ecosystem: HDFS, MapReduce, Hive, Pig.
Apache Spark: Fast in-memory processing.
NoSQL DBs: MongoDB, Cassandra (flexible schemas).
BI Tools: Tableau, Power BI.
Programming: Python, R.

Real-world pipeline: Retail company collects POS data, cleans it, loads into HDFS, runs Spark jobs for daily sales trends, and visualizes in Tableau.

7. Analysis vs Reporting

Item	Analysis	Reporting
Purpose	Extract insights, test hypotheses	Summarize past events
Data	Raw, detailed data	Aggregated data
Tools	Statistical/ML tools	BI/reporting software
Output	Models, patterns, predictions	Tables, charts, dashboards

Example: An analyst uses Spark to segment customers (analysis). Marketing gets a monthly sales dashboard (reporting).

Big Data is very large, fast, and varied data that conventional systems cannot handle. A platform with distributed storage and processing enables analysis, turning raw data into useful knowledge. Understanding data types and analytic processes is essential. Analysis goes beyond reporting by uncovering new patterns and predictions.