Chapter 13: Big Data & Scalability (optional but valuable)

Big Data & Scalability (optional but valuable), explained like we’re wrapping up our marathon session in Hyderabad (it’s around 5:53 PM IST on January 29, 2026 — sun’s down, lights on in Hi-Tech City, and we’re talking about the stuff that handles real scale). This chapter is optional because most entry-to-mid-level DS roles in India (2026) still run on Pandas/NumPy + cloud notebooks for datasets up to ~10–50 GB. But once you hit production pipelines, telecom logs, UPI transaction streams, or customer behavior at scale (think Jio’s millions of daily recharges, PhonePe’s fraud logs, or Flipkart’s clickstreams), Pandas crashes (out-of-memory errors), and that’s where Spark/PySpark saves the day.

We’ll keep it practical: why big data matters in India 2026, Hadoop overview (legacy but foundational), PySpark deep dive with examples, and hands-on tips for large datasets.

Why Big Data & Scalability Matter in 2026 (India Context)

  • Datasets exploding: Telecom (Jio/Airtel) → petabytes of call detail records (CDR), recharge patterns. Fintech → billions of UPI txns/day. E-commerce → clickstream logs >100 GB daily.
  • Pandas limit: Single machine, RAM-bound (~16–64 GB typical laptop/server). 50 GB+ CSV → crash or crawl.
  • Spark advantage: Distributed — split work across 10–100+ machines (cluster). Fault-tolerant (node fails → retry), scalable (add nodes), in-memory fast.
  • 2026 reality: Databricks (Spark-based) + Snowflake + GCP BigQuery dominate cloud. PySpark = Python-friendly entry to big data jobs (₹15–40 LPA mid-level in Hyderabad/Bangalore).

Pandas vs PySpark quick vibe check (2026):

  • <5–10 GB, EDA/exploration → Pandas (faster, familiar).
  • 10–50 GB, production ETL/pipelines → PySpark (scales, fault-tolerant).

  • Hybrid: PySpark for heavy lift → convert to Pandas for final viz/modeling.

Hadoop Ecosystem Overview

Hadoop (2006 era) started big data revolution — open-source, commodity hardware clusters. In 2026 it’s legacy infrastructure (most companies migrated to cloud-native like Databricks/Snowflake), but understanding it helps interviews/resumes (many enterprises still run Hadoop clusters).

Core 4 components (Hadoop 3.x era):

  1. HDFS (Hadoop Distributed File System) — Distributed storage.
    • Files split into 128–256 MB blocks → replicated (default 3x) across nodes.
    • NameNode (master: metadata), DataNodes (slaves: store blocks).
    • Fault-tolerant: If node dies, replicas on others serve data.
  2. YARN (Yet Another Resource Negotiator) — Resource manager & scheduler.
    • Handles cluster resources (CPU, memory).
    • Applications (Spark jobs, MapReduce) request containers → YARN allocates.
  3. MapReduce — Original batch processing framework.
    • Map → process data in parallel. Reduce → aggregate.
    • Slow (disk-heavy), largely replaced by Spark.
  4. Hadoop Common — Utilities, libraries shared by all.

Popular ecosystem tools (still relevant in legacy setups):

  • Hive — SQL on Hadoop (HiveQL → MapReduce/Spark under hood). Great for analysts.
  • Pig — Scripting for ETL (Pig Latin).
  • HBase — NoSQL column store (real-time read/write on HDFS).
  • Sqoop — Import/export RDBMS ↔ Hadoop.
  • Flume/Kafka — Streaming ingestion.
  • Oozie — Workflow scheduler.
  • Zookeeper — Coordination service.

2026 status: Pure Hadoop rare. Spark-on-YARN or Databricks (Spark + Delta Lake) dominates. Know HDFS/YARN/MapReduce concepts — interviewers ask “How does Spark differ from MapReduce?” (in-memory vs disk, DAG vs 2-stage).

Introduction to Spark (PySpark)

Apache Spark (2010s) — faster, general-purpose successor to MapReduce. In-memory processing, unified engine (batch, streaming, ML, SQL, graphs).

Key concepts:

  • RDD (Resilient Distributed Dataset) — low-level, immutable, fault-tolerant collection. Rarely use directly now.
  • DataFrame / Dataset — higher-level, like Pandas table. Optimized (Catalyst optimizer, Tungsten execution).
  • SparkSession — entry point (replaces old SparkContext + SQLContext).
  • Lazy evaluation — transformations build plan; actions trigger computation.
  • Partitions — data split across cluster (parallelism).

Install PySpark locally (for learning; production uses cluster):

Bash

Basic PySpark script (word count classic — scale to GBs easily):

Python

Spark on large datasets (e.g., scaled Telco churn logs — imagine 100 GB of daily customer events):

  1. Read from cloud storage (S3/GCS/Azure Blob):
Python
  1. EDA at scale (no Pandas crash):
Python
  1. Handle big joins (e.g., join customer master + transaction logs):
Python
  1. ML at scale (MLlib):
Python

Tips for large datasets:

  • Use Parquet/ORC format (columnar, compressed) — 5–10x smaller/faster than CSV.
  • Cache/persist hot data: df.cache() or df.persist().
  • Tune partitions: spark.conf.set(“spark.sql.shuffle.partitions”, 200) (match cluster cores).
  • Avoid UDFs if possible — use built-in functions (faster).
  • Monitor Spark UI (localhost:4040) — spot skew, spills to disk.

Real-world India example: Telecom churn — process 1 TB CDR logs daily → aggregate monthly usage per user → join with CRM → train churn model weekly on Databricks/Spark cluster.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *