Course Outline

4 min readEdit on GitHub

Course Outline

Module 1: Foundations of Data Systems

1.1 Understanding Data

  • What is data?
  • How raw data becomes useful information
  • Types of data:
  • Structured (e.g., tables)
  • Unstructured (e.g., images, videos)
  • Semi-structured (e.g., JSON, XML)
  • Simple examples from daily life

1.2 Systems of Record vs Systems of Analysis

  • What is OLTP (online transaction processing)?
  • What is OLAP (online analytical processing)?
  • Difference in purpose (day-to-day usage vs insights and reporting)

Module 2: Databases and Their Design

2.1 What is a Database?

  • What is a database and why we use it
  • Difference between storing data in normal files vs database systems

2.2 DBMS (Database Management System)

  • What is a DBMS?
  • Basic DBMS functions: storing, querying, updating, managing data

2.3 Types of Databases

  • Relational Databases (e.g., MySQL, PostgreSQL): store data in tables
  • NoSQL Databases (short overview only):
  • Key-value stores (e.g., Redis)
  • Document stores (e.g., MongoDB)
  • Column-based stores (e.g., Cassandra)
  • Graph-based stores (e.g., Neo4j)

2.4 Overview of Strengths & Limitations

  • Relational databases work well with structured data, but can struggle with very large or flexible datasets
  • NoSQL databases handle large, flexible or fast-changing data, but may not fully support transactions in all cases

Module 3: System Scalability and Distributed Architecture

3.1 Scaling Basics

  • Vertical vs horizontal scaling
  • Load balancing basics

3.2 Distributed Systems Concepts

  • What makes a system “distributed”?
  • CAP Theorem (Consistency, Availability, Partition tolerance)
  • Data replication and sharding
  • Fault tolerance

Module 4: Linux Essentials for Data Engineering

4.1 Introduction to Linux

  • Why Linux for data platforms
  • Differences from Windows OS

4.2 Key Concepts

  • File system hierarchy
  • Permissions and ownership
  • Processes and networking basics

4.3 Essential Commands

  • Navigating file system (ls, cd, pwd)
  • File operations (cat, cp, mv)
  • User and permissions (chmod, chown)
  • Service management and monitoring

Module 5: Hadoop Ecosystem Essentials

5.1 Introduction to Hadoop

  • Why Hadoop was invented
  • Hadoop architectural overview

5.2 Key Components

  • HDFS: Storage layer fundamentals
  • YARN: Resource management
  • MapReduce: Parallel processing (conceptual only)

5.3 Hadoop File System Basics

  • Block storage
  • Replication and fault tolerance

5.4 Practical Setup

  • Running Hadoop on WSL
  • Getting familiar with HDFS commands

Module 6: Apache Spark for Data Processing

6.1 Introduction to Spark

  • What is Apache Spark?
  • Why Spark replaced MapReduce

6.2 Spark Building Blocks

  • RDDs (Resilient Distributed Datasets)
  • DataFrames and Datasets
  • Spark SQL overview

6.3 Spark Execution Model

  • DAGs (Directed Acyclic Graphs)
  • Lazy execution and fault tolerance

6.4 Running Spark on Hadoop

  • Configuring Spark with Hadoop (YARN cluster mode on WSL)
  • Using Spark to process data from HDFS

Module 7: Introduction to Cloud Computing

7.1 Why Cloud for Big Data

  • Cloud-native vs on-premise infrastructure
  • Elastic scaling and cost models

7.2 AWS Overview

  • Understanding billing and free tier limits
  • Core services for data:
  • S3
  • EC2
  • IAM
  • Glue (optional)

7.3 Setting Up on AWS Free Tier

  • Create account and secure it
  • Build an S3 bucket for storage
  • Deploy a small EC2 instance for data pipeline testing

Module 8: Cloud-Based Data Engineering Workflow

8.1 Storage Layer

  • Storing raw data in AWS S3

8.2 Compute Layer

  • Running Spark on EC2 (or local setup with WSL)
  • Using AWS Glue or PySpark scripts on cloud

8.3 Data Delivery

  • Exporting processed data for visualization
  • Integration with Power BI / QuickSight (optional)

Module 9: Putting It All Together

9.1 End-to-End Pipeline

  • Data ingestion → Storage → Processing → Serving
  • Local Hadoop + Spark + S3 workflow

9.2 Deployment Strategy

  • On WSL for local
  • On AWS for cloud

9.3 Cost Optimization

  • Controlling usage within free tier
  • Best practices for storage, compute, and transfers