Chapter 9: Advanced Machine Learning
Advanced Machine Learning, explained like we’re deep into a late-night coding session in Airoli — your Jupyter notebook glowing, me walking you through each concept with real code snippets using the Telco Customer Churn dataset we’ve been building on since Chapter 7. This chapter takes us beyond basics: we push performance with ensembles, explore hidden patterns with unsupervised methods, reduce curse of dimensionality, dip into time series (relevant for churn trends over months), and clean up features smarter.
In 2026 India job market (fintech/telecom like Jio, Airtel, PhonePe), these techniques are expected in mid-level interviews and projects — especially XGBoost/LightGBM for tabular wins, PCA/UMAP for viz, and feature selection to explain “why churn happens”.
1. Ensemble Methods: Bagging, Boosting (XGBoost, LightGBM, CatBoost)
Bagging (Bootstrap Aggregating) — Train many models on random subsets → average/vote. Reduces variance (e.g., Random Forest from Chapter 8 is bagging + random features).
Boosting — Sequential: each model fixes previous errors. Focus on hard examples.
XGBoost (eXtreme Gradient Boosting) — Still king for control + speed (parallel, regularization, GPU). Great for competitions.
LightGBM (Microsoft) — Often fastest on large data (leaf-wise growth, histogram binning, lower memory).
CatBoost (Yandex) — Best out-of-box on categorical data (ordered boosting, no leakage, automatic handling).
From 2025–2026 benchmarks: LightGBM edges speed on huge numerical sets; CatBoost wins ease + categoricals; XGBoost reliable/tunable. Always benchmark on your data!
Installs (if needed):
|
0 1 2 3 4 5 6 |
pip install xgboost lightgbm catboost |
XGBoost on Telco Churn (building on our prepped df from Ch 7/8)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
from xgboost import XGBClassifier from sklearn.metrics import roc_auc_score xgb = XGBClassifier( n_estimators=300, max_depth=6, learning_rate=0.05, subsample=0.8, colsample_bytree=0.8, scale_pos_weight= (len(y_train)-sum(y_train))/sum(y_train), # for imbalance random_state=42, n_jobs=-1, eval_metric='auc' ) xgb.fit(X_train, y_train) probs_xgb = xgb.predict_proba(X_test)[:, 1] print("XGBoost ROC-AUC:", roc_auc_score(y_test, probs_xgb).round(4)) # Typical: 0.86–0.88 after tuning |
Feature importance (business gold):
|
0 1 2 3 4 5 6 7 8 9 10 |
importances = pd.Series(xgb.feature_importances_, index=X.columns).sort_values(ascending=False)[:15] sns.barplot(x=importances, y=importances.index) plt.title('XGBoost Top Features - Churn Drivers') plt.show() # Expect: Contract_Month-to-month, tenure, MonthlyCharges, fiber optic, etc. |
LightGBM (faster training):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import lightgbm as lgb lgb_model = lgb.LGBMClassifier( n_estimators=300, learning_rate=0.05, max_depth=7, num_leaves=31, colsample_bytree=0.8, subsample=0.8, class_weight='balanced', random_state=42, n_jobs=-1, metric='auc' ) lgb_model.fit(X_train, y_train) print("LightGBM ROC-AUC:", roc_auc_score(y_test, lgb_model.predict_proba(X_test)[:,1]).round(4)) |
CatBoost (categorical heaven — no need for one-hot on original categoricals!):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
from catboost import CatBoostClassifier cat_model = CatBoostClassifier( iterations=500, learning_rate=0.03, depth=6, cat_features=[col for col in X_train.columns if X_train[col].dtype == 'object' or 'category' in str(X_train[col].dtype)], # auto-detect auto_class_weights='Balanced', verbose=100, random_state=42 ) cat_model.fit(X_train, y_train) print("CatBoost ROC-AUC:", roc_auc_score(y_test, cat_model.predict_proba(X_test)[:,1]).round(4)) # Often best out-of-box on mixed data |
Tip: Ensemble them (VotingClassifier or stacking) → +0.5–1% AUC lift common.
2. Unsupervised Learning
No labels — find patterns.
Clustering: Group similar customers.
K-Means (centroid-based, needs k):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler # Scale (already did most in Ch7, but re-scale for clustering) scaler = StandardScaler() X_scaled = scaler.fit_transform(X.drop(columns=[c for c in X.columns if 'target' in c.lower()])) # exclude Churn if present kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) clusters = kmeans.fit_predict(X_scaled) df['cluster_kmeans'] = clusters df.groupby('cluster_kmeans')['Churn'].mean() # churn rate per cluster |
Elbow method for k:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
inertias = [] for k in range(1, 11): km = KMeans(n_clusters=k, random_state=42, n_init=10) km.fit(X_scaled) inertias.append(km.inertia_) plt.plot(range(1,11), inertias, marker='o') plt.xlabel('k') plt.ylabel('Inertia') plt.title('Elbow Method') plt.show() |
DBSCAN (density-based, finds noise, no need k):
|
0 1 2 3 4 5 6 7 8 9 |
from sklearn.cluster import DBSCAN db = DBSCAN(eps=0.5, min_samples=5) clusters_db = db.fit_predict(X_scaled[:2000]) # subsample for speed |
Hierarchical (dendrogram, good for small data):
|
0 1 2 3 4 5 6 7 8 9 10 11 |
from scipy.cluster.hierarchy import dendrogram, linkage Z = linkage(X_scaled[:500], method='ward') # subsample dendrogram(Z) plt.title('Hierarchical Clustering Dendrogram') plt.show() |
Use clusters for segmentation: “Cluster 0: High-tenure loyal → low churn”.
3. Dimensionality Reduction: PCA, t-SNE, UMAP
PCA (linear, variance-maximizing):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from sklearn.decomposition import PCA pca = PCA(n_components=0.95) # keep 95% variance X_pca = pca.fit_transform(X_scaled) print("Components needed:", pca.n_components_) print("Explained variance:", pca.explained_variance_ratio_.cumsum()) # Visualize top 2 plt.scatter(X_pca[:,0], X_pca[:,1], c=df['Churn'], cmap='viridis', alpha=0.6) plt.title('PCA - Churn Separation') plt.colorbar(label='Churn (1=yes)') plt.show() |
t-SNE (non-linear, great viz, slow on big data):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from sklearn.manifold import TSNE tsne = TSNE(n_components=2, perplexity=30, random_state=42) X_tsne = tsne.fit_transform(X_scaled[:2000]) # subsample plt.scatter(X_tsne[:,0], X_tsne[:,1], c=df['Churn'][:2000], cmap='viridis') plt.title('t-SNE Visualization of Churn') plt.show() # Clusters may show churn groups better than PCA |
UMAP (faster than t-SNE, preserves global structure better):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
import umap umap_reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42) X_umap = umap_reducer.fit_transform(X_scaled[:2000]) plt.scatter(X_umap[:,0], X_umap[:,1], c=df['Churn'][:2000], cmap='viridis') plt.title('UMAP - Better Global Structure') plt.show() |
UMAP often preferred in 2026 for viz + can be used for pre-processing.
4. Time Series Analysis Basics (ARIMA, Prophet Intro)
Churn can have trends (e.g., monthly churn rate).
Aggregate to monthly churn rate:
|
0 1 2 3 4 5 6 7 8 9 |
df['join_month'] = pd.to_datetime(df['join_date']) # assume you have date or use index # Or simulate monthly churn rate monthly_churn = df.groupby(pd.Grouper(key='some_date_col', freq='M'))['Churn'].mean().reset_index() monthly_churn.columns = ['ds', 'y'] # Prophet format: ds (date), y (value) |
ARIMA (classic):
|
0 1 2 3 4 5 6 7 8 9 10 11 |
from statsmodels.tsa.arima.model import ARIMA model_arima = ARIMA(monthly_churn['y'], order=(1,1,1)) fit_arima = model_arima.fit() forecast = fit_arima.forecast(steps=6) print(forecast) |
Prophet (easy, handles seasonality/holidays):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from prophet import Prophet prophet_model = Prophet(yearly_seasonality=True, weekly_seasonality=False, daily_seasonality=False) prophet_model.fit(monthly_churn) future = prophet_model.make_future_dataframe(periods=12, freq='M') forecast = prophet_model.predict(future) prophet_model.plot(forecast) plt.title('Prophet Forecast - Future Monthly Churn Rate') plt.show() prophet_model.plot_components(forecast) # trend, seasonality |
Prophet shines for business: auto detects changepoints, adds holidays (e.g., festive season spikes).
5. Feature Selection Techniques
Reduce noise, speed up, improve interpretability.
From tree models (built-in):
|
0 1 2 3 4 5 6 7 8 |
# XGBoost importance xgb_importance = pd.Series(xgb.feature_importances_, index=X.columns).sort_values(ascending=False) top_features = xgb_importance[xgb_importance > 0.01].index.tolist() # threshold |
Recursive Feature Elimination (RFE):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression rfe = RFE(LogisticRegression(max_iter=1000), n_features_to_select=15) rfe.fit(X_train, y_train) selected = X_train.columns[rfe.support_] print("RFE Selected:", selected) |
SelectKBest (statistical):
|
0 1 2 3 4 5 6 7 8 9 10 |
from sklearn.feature_selection import SelectKBest, f_classif selector = SelectKBest(f_classif, k=20) X_selected = selector.fit_transform(X_train, y_train) selected_cols = X_train.columns[selector.get_support()] |
Boruta (wrapper around RF — finds all relevant):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# pip install boruta from boruta import BorutaPy from sklearn.ensemble import RandomForestClassifier rf_boruta = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5) boruta = BorutaPy(rf_boruta, n_estimators='auto', verbose=2, random_state=42) boruta.fit(X_train.values, y_train.values) confirmed = X_train.columns[boruta.support_] print("Boruta Confirmed Features:", confirmed) |
In practice: Combine tree importance + RFE/Boruta → retrain final model on 15–25 top features → often same/better AUC, faster inference.
That’s Chapter 9 — advanced tools that make models production-ready!
