Chapter 8: Machine Learning Fundamentals
Machine Learning Fundamentals, explained like we’re back in that Airoli café — screens side by side, me running code cells live while you ask questions. This is the chapter where we finally build and evaluate models using the Telco Customer Churn dataset we prepped in Chapter 7. In 2026, even with GenAI tools helping generate code, understanding why a model works (or fails) is what separates junior from mid/senior roles in Indian companies (Paytm, PhonePe, Jio, startups in Navi Mumbai).
We’ll cover supervised learning only (regression + classification), evaluation metrics (especially important for imbalanced churn ~27%), overfitting concepts, and tuning — all with code you can copy-paste into your Jupyter notebook.
Quick setup reminder (from Chapter 7):
- Load cleaned df (Churn = 0/1, dropped customerID, encoded categoricals, scaled numerics, etc.)
- Split data:
|
0 1 2 3 4 5 6 7 8 9 10 11 |
from sklearn.model_selection import train_test_split X = df.drop('Churn', axis=1) y = df['Churn'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) |
1. Supervised Learning Overview
Supervised = we have labeled data (features X → target y). Goal: learn mapping X → y so it generalizes to new data.
- Regression: Predict continuous number (e.g., MonthlyCharges if missing, or house price).
- Classification: Predict category (here: Churn Yes/No → binary classification).
2. Regression: Linear, Ridge, Lasso, Polynomial
Even though our target is binary, let’s first demo regression on a regression-like task: predict TotalCharges (but we’ll skip since churn is classification — just for completeness).
Linear Regression (baseline):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error, r2_score # Imagine predicting MonthlyCharges from tenure, etc. lin_reg = LinearRegression() lin_reg.fit(X_train[['tenure', 'MonthlyCharges']], y_train) # dummy preds = lin_reg.predict(X_test[['tenure', 'MonthlyCharges']]) print("MAE:", mean_absolute_error(y_test, preds)) print("R²:", r2_score(y_test, preds)) |
Ridge / Lasso — Regularized linear (prevent overfitting, handle multicollinearity):
- Ridge (L2): Shrinks coefficients → good when features correlated (tenure & charges_per_month_trend).
- Lasso (L1): Can set some coefficients to zero → feature selection.
|
0 1 2 3 4 5 6 7 8 9 10 11 |
from sklearn.linear_model import Ridge, Lasso ridge = Ridge(alpha=1.0) ridge.fit(X_train, y_train) # for classification we'd use Logistic, but demo lasso = Lasso(alpha=0.01) # higher alpha → more shrinkage |
Polynomial Regression — Capture non-linear (e.g., tenure effect curves):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=2, include_bias=False) X_train_poly = poly.fit_transform(X_train[['tenure']]) X_test_poly = poly.transform(X_test[['tenure']]) lin_reg_poly = LinearRegression() lin_reg_poly.fit(X_train_poly, y_train) |
For churn → we use Logistic instead (next).
3. Classification: Logistic Regression, Decision Trees, Random Forest
Logistic Regression — Linear for binary (uses sigmoid to output probability 0–1).
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42) log_reg.fit(X_train, y_train) probs = log_reg.predict_proba(X_test)[:, 1] # probability of churn=1 preds = log_reg.predict(X_test) |
Decision Tree — Simple, interpretable, but overfits easily.
|
0 1 2 3 4 5 6 7 8 9 |
from sklearn.tree import DecisionTreeClassifier tree = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10, random_state=42, class_weight='balanced') tree.fit(X_train, y_train) |
Random Forest — Ensemble of trees (bagging + feature randomness) → robust, less overfitting.
|
0 1 2 3 4 5 6 7 8 9 10 |
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=200, max_depth=10, min_samples_split=5, class_weight='balanced', random_state=42, n_jobs=-1) rf.fit(X_train, y_train) |
Feature importance (great for insights):
|
0 1 2 3 4 5 6 7 8 9 10 |
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False) sns.barplot(x=importances, y=importances.index) plt.title('Random Forest Feature Importance - Churn Drivers') plt.show() # Expect: Contract_Month-to-month, tenure, MonthlyCharges high |
4. Model Evaluation: Accuracy, Precision, Recall, F1, ROC-AUC, Confusion Matrix
Churn is imbalanced (~73% No, 27% Yes) → accuracy misleading (predict all No → 73% acc but useless).
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score from sklearn.metrics import roc_auc_score, confusion_matrix, ConfusionMatrixDisplay def evaluate_model(model, X_test, y_test, name="Model"): preds = model.predict(X_test) probs = model.predict_proba(X_test)[:, 1] print(f"\n=== {name} ===") print("Accuracy: ", round(accuracy_score(y_test, preds), 4)) print("Precision: ", round(precision_score(y_test, preds), 4)) # Of predicted churn, how many really churned? print("Recall: ", round(recall_score(y_test, preds), 4)) # Of real churners, how many caught? print("F1: ", round(f1_score(y_test, preds), 4)) # Harmonic mean precision-recall print("ROC-AUC: ", round(roc_auc_score(y_test, probs), 4)) # Area under ROC curve (0.5 random, 1.0 perfect) cm = confusion_matrix(y_test, preds) disp = ConfusionMatrixDisplay(cm, display_labels=['No Churn', 'Churn']) disp.plot(cmap='Blues') plt.title(f'Confusion Matrix - {name}') plt.show() |
Interpretation for churn:
- High Recall → catch most churners (business wants to retain them → offer discounts).
- High Precision → when we predict churn, it’s correct (don’t waste offers on loyal customers).
- F1 balances both.
- ROC-AUC good overall (threshold-independent).
Typical results on Telco (after tuning):
- Logistic: AUC ~0.84
- RF: AUC ~0.85–0.87 (better)
5. Cross-validation, Overfitting/Underfitting, Bias-Variance Tradeoff
Overfitting — Model memorizes train data → great train score, poor test. Underfitting — Too simple → bad on both.
Bias-Variance Tradeoff:
- High bias (underfit): Simple model misses patterns.
- High variance (overfit): Complex model fits noise.
Cross-validation (k-fold) — Better than single train-test split.
|
0 1 2 3 4 5 6 7 8 9 10 11 |
from sklearn.model_selection import cross_val_score, StratifiedKFold cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(rf, X, y, cv=cv, scoring='roc_auc', n_jobs=-1) print("CV ROC-AUC:", scores.mean().round(4), "±", scores.std().round(4)) |
Plot learning curves (train vs val error):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from sklearn.model_selection import learning_curve train_sizes, train_scores, val_scores = learning_curve( rf, X, y, cv=5, scoring='roc_auc', train_sizes=np.linspace(0.1, 1.0, 10)) plt.plot(train_sizes, train_scores.mean(axis=1), label='Train') plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation') plt.xlabel('Training Set Size') plt.ylabel('ROC-AUC') plt.legend() plt.title('Learning Curve - Random Forest') plt.show() # Gap = variance (overfit); low both = bias (underfit) |
6. Hyperparameter Tuning (GridSearchCV, RandomizedSearchCV)
GridSearchCV — Exhaustive (slow on big grid).
RandomizedSearchCV — Samples randomly (faster, often better).
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV param_grid = { 'n_estimators': [100, 200, 300], 'max_depth': [8, 10, 12, None], 'min_samples_split': [2, 5, 10], 'class_weight': ['balanced', None] } # Grid (slow but thorough) grid = GridSearchCV(rf, param_grid, cv=5, scoring='roc_auc', n_jobs=-1, verbose=1) grid.fit(X_train, y_train) print("Best params:", grid.best_params_) print("Best CV AUC:", grid.best_score_.round(4)) # Or Randomized (faster) rand = RandomizedSearchCV(rf, param_grid, n_iter=20, cv=5, scoring='roc_auc', n_jobs=-1, random_state=42) rand.fit(X_train, y_train) |
In 2026: Use Optuna or Bayesian optimization for faster tuning, but Grid/Random still interview classics.
Final Project Tip: In your notebook:
- Compare Logistic vs Tree vs RF (with/without tuning).
- Pick best model (usually tuned RF).
- Save: import joblib; joblib.dump(best_model, ‘churn_rf_model.pkl’)
- Business story: “Model catches 82% of churners (recall), allowing targeted retention offers → potential ₹X crore saved.”
That’s Chapter 8 — the core of predictive modeling!
