Chapter 8: Machine Learning Fundamentals

Machine Learning Fundamentals, explained like we’re back in that Airoli café — screens side by side, me running code cells live while you ask questions. This is the chapter where we finally build and evaluate models using the Telco Customer Churn dataset we prepped in Chapter 7. In 2026, even with GenAI tools helping generate code, understanding why a model works (or fails) is what separates junior from mid/senior roles in Indian companies (Paytm, PhonePe, Jio, startups in Navi Mumbai).

We’ll cover supervised learning only (regression + classification), evaluation metrics (especially important for imbalanced churn ~27%), overfitting concepts, and tuning — all with code you can copy-paste into your Jupyter notebook.

Quick setup reminder (from Chapter 7):

  • Load cleaned df (Churn = 0/1, dropped customerID, encoded categoricals, scaled numerics, etc.)
  • Split data:
Python

1. Supervised Learning Overview

Supervised = we have labeled data (features X → target y). Goal: learn mapping X → y so it generalizes to new data.

  • Regression: Predict continuous number (e.g., MonthlyCharges if missing, or house price).
  • Classification: Predict category (here: Churn Yes/No → binary classification).

2. Regression: Linear, Ridge, Lasso, Polynomial

Even though our target is binary, let’s first demo regression on a regression-like task: predict TotalCharges (but we’ll skip since churn is classification — just for completeness).

Linear Regression (baseline):

Python

Ridge / Lasso — Regularized linear (prevent overfitting, handle multicollinearity):

  • Ridge (L2): Shrinks coefficients → good when features correlated (tenure & charges_per_month_trend).
  • Lasso (L1): Can set some coefficients to zero → feature selection.
Python

Polynomial Regression — Capture non-linear (e.g., tenure effect curves):

Python

For churn → we use Logistic instead (next).

3. Classification: Logistic Regression, Decision Trees, Random Forest

Logistic Regression — Linear for binary (uses sigmoid to output probability 0–1).

Python

Decision Tree — Simple, interpretable, but overfits easily.

Python

Random Forest — Ensemble of trees (bagging + feature randomness) → robust, less overfitting.

Python

Feature importance (great for insights):

Python

4. Model Evaluation: Accuracy, Precision, Recall, F1, ROC-AUC, Confusion Matrix

Churn is imbalanced (~73% No, 27% Yes) → accuracy misleading (predict all No → 73% acc but useless).

Python

Interpretation for churn:

  • High Recall → catch most churners (business wants to retain them → offer discounts).
  • High Precision → when we predict churn, it’s correct (don’t waste offers on loyal customers).
  • F1 balances both.
  • ROC-AUC good overall (threshold-independent).

Typical results on Telco (after tuning):

  • Logistic: AUC ~0.84
  • RF: AUC ~0.85–0.87 (better)

5. Cross-validation, Overfitting/Underfitting, Bias-Variance Tradeoff

Overfitting — Model memorizes train data → great train score, poor test. Underfitting — Too simple → bad on both.

Bias-Variance Tradeoff:

  • High bias (underfit): Simple model misses patterns.
  • High variance (overfit): Complex model fits noise.

Cross-validation (k-fold) — Better than single train-test split.

Python

Plot learning curves (train vs val error):

Python

6. Hyperparameter Tuning (GridSearchCV, RandomizedSearchCV)

GridSearchCV — Exhaustive (slow on big grid).

RandomizedSearchCV — Samples randomly (faster, often better).

Python

In 2026: Use Optuna or Bayesian optimization for faster tuning, but Grid/Random still interview classics.

Final Project Tip: In your notebook:

  • Compare Logistic vs Tree vs RF (with/without tuning).
  • Pick best model (usually tuned RF).
  • Save: import joblib; joblib.dump(best_model, ‘churn_rf_model.pkl’)
  • Business story: “Model catches 82% of churners (recall), allowing targeted retention offers → potential ₹X crore saved.”

That’s Chapter 8 — the core of predictive modeling!

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *