Chapter 7: Exploratory Data Analysis + Feature Engineering (Combined Project Phase)
Exploratory Data Analysis + Feature Engineering (Combined Project Phase), explained like we’re sitting side by side in Airoli — your laptop on one side, mine on the other, Jupyter Notebook open, chai getting cold because we’re too deep into the code. This is the chapter where everything clicks: you stop treating data as “just numbers” and start treating it like a story with business meaning.
In real 2026 data science jobs (especially in India — fintech, e-commerce, startups in Mumbai/Navi Mumbai/Hyderabad), EDA + Feature Engineering is 60–80% of your time before any model touches the data. Companies don’t pay for fancy XGBoost if the features suck or you missed obvious patterns.
We’ll do this hands-on with a realistic example: Telco Customer Churn (very common in Indian telecom/fintech interviews — think Jio, Airtel, or banking apps). It’s perfect for India context: high churn due to competition, prepaid/postpaid switches, recharge patterns, etc.
Dataset link (download CSV from Kaggle): https://www.kaggle.com/datasets/blastchar/telco-customer-churn
Columns snapshot (after quick peek):
- customerID, gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, …, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn (Yes/No target)
Full EDA Workflow on a Real Dataset (Step-by-Step)
Step 1: Load & First Look
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns sns.set_style("whitegrid") plt.rcParams['figure.figsize'] = (10, 6) # Load df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv') print(df.shape) # (7043, 21) df.head(8) df.info() # TotalCharges is object → needs fix df.describe(include='all') # tenure 0–72 months, MonthlyCharges ~₹20–120 equiv |
Immediate fixes (common in real data):
|
0 1 2 3 4 5 6 7 8 9 10 11 |
# TotalCharges has ' ' for new customers → convert & fill df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce') df['TotalCharges'].fillna(0, inplace=True) # or median if you prefer # Churn to 0/1 for modeling later df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0}) |
Step 2: Quality Check
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Missing df.isnull().sum() # Usually only TotalCharges had some # Duplicates df.duplicated().sum() # 0 good # Cardinality (how many unique) df.nunique() # customerID: 7043 → drop it later (no info) # gender: 2, SeniorCitizen: 2, tenure: 73, etc. |
Step 3: Univariate Analysis
- Numerical: tenure, MonthlyCharges, TotalCharges
|
0 1 2 3 4 5 6 7 8 9 10 11 |
sns.histplot(df['tenure'], bins=30, kde=True) plt.title('Tenure Distribution - Many new & long-loyal customers') plt.show() sns.boxplot(x=df['Churn'], y=df['MonthlyCharges']) # Insight: Higher monthly charges → more churn (price sensitive market) |
- Categorical: gender, Contract, PaymentMethod, InternetService
|
0 1 2 3 4 5 6 7 8 9 10 11 |
sns.countplot(x='Contract', hue='Churn', data=df) plt.title('Month-to-month contracts churn way more!') plt.show() # Churn rate overall print(df['Churn'].mean() * 100) # ~26.5% — imbalanced! |
Key insights so far (markdown in your notebook):
- ~73% non-churn → imbalanced classification problem.
- Month-to-month contracts + fiber optic + electronic check payment = high churn groups.
- Tenure: New customers (0–6 months) churn fast → survival analysis hint.
Step 4: Bivariate / Multivariate
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# Correlation heatmap (numeric only) numeric_df = df.select_dtypes(include=np.number) corr = numeric_df.corr() sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f') # tenure & TotalCharges highly correlated (0.83) → multicollinearity later # Pairplot for quick multivariate sns.pairplot(df[['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']], hue='Churn', diag_kind='kde') |
Step 5: Deep Dive Insights Groupby magic:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
df.groupby('InternetService')['Churn'].mean().sort_values(ascending=False) # Fiber optic: ~42% churn, DSL ~19%, No: ~7% → Fiber expensive/unreliable? df.groupby('PaymentMethod')['MonthlyCharges'].mean() # Electronic check pays most → maybe less loyal? # New feature idea already: Is senior + partner + dependents = family stability? |
Feature Engineering (The Real Magic)
1. Feature Creation
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# Tenure groups (common in churn) df['tenure_group'] = pd.cut(df['tenure'], bins=[0, 6, 12, 24, 36, 48, 60, 72], labels=['0-6m', '7-12m', '1-2y', '2-3y', '3-4y', '4-5y', '5+y']) # Avg monthly charge trend (if tenure low but charges high → churn risk) df['charges_per_month_trend'] = df['TotalCharges'] / (df['tenure'] + 1) # +1 avoid div0 # Services bundle count (more services → stickier?) service_cols = ['PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'] df['num_services'] = df[service_cols].apply(lambda x: (x == 'Yes').sum() + (x != 'No internet service').sum(), axis=1) # Family flag df['has_family'] = ((df['Partner'] == 'Yes') | (df['Dependents'] == 'Yes')).astype(int) |
2. Encoding Categorical Variables
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# One-hot (nominal, low cardinality) df = pd.get_dummies(df, columns=['gender', 'Contract', 'PaymentMethod', 'InternetService'], drop_first=True, dtype=int) # Target encoding / mean encoding (high cardinality or strong target relation) # Example: mean churn per tenure_group tenure_churn = df.groupby('tenure_group')['Churn'].mean().to_dict() df['tenure_group_target_enc'] = df['tenure_group'].map(tenure_churn) # Frequency encoding (if city/state had many → but here low) # df['PaymentMethod_freq'] = df['PaymentMethod'].map(df['PaymentMethod'].value_counts(normalize=True)) |
3. Scaling Numerical Features (for ML later — distance-based algos care)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = StandardScaler() num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges', 'charges_per_month_trend'] df[num_cols] = scaler.fit_transform(df[num_cols]) # Or MinMax if you want 0-1 range # minmax = MinMaxScaler() # df[num_cols] = minmax.fit_transform(df[num_cols]) |
4. Handling Imbalanced Data (Churn ~27%) Options (don’t apply yet — during modeling):
- Undersample majority (random)
- Oversample minority: SMOTE (from imblearn)
- Class weight in models (easiest: LogisticRegression(class_weight=’balanced’))
- Generate synthetic: ADASYN, etc.
Quick check imbalance:
|
0 1 2 3 4 5 6 7 |
sns.countplot(x='Churn', data=df) plt.title('Class Imbalance - Churn is minority') |
5. Handling Multicollinearity
- tenure & TotalCharges: 0.83 → drop one or use PCA later.
- Use VIF (Variance Inflation Factor) to detect:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
from statsmodels.stats.outliers_influence import variance_inflation_factor X_num = df[num_cols].dropna() vif_data = pd.DataFrame() vif_data["feature"] = X_num.columns vif_data["VIF"] = [variance_inflation_factor(X_num.values, i) for i in range(X_num.shape[1])] print(vif_data.sort_values('VIF', ascending=False)) # If VIF > 5–10 → problem → drop or combine |
Common action: Drop TotalCharges if using tenure + MonthlyCharges.
Final Cleaned Dataset Prep
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# Drop useless/highly correlated df.drop(['customerID', 'TotalCharges'], axis=1, inplace=True) # example # Handle remaining categoricals (e.g., Yes/No to 1/0) yes_no_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', ...] df[yes_no_cols] = df[yes_no_cols].replace({'Yes':1, 'No':0}) # One last check df.info() |
Wrap-up Project Tips for Your Portfolio
- Save this notebook as “Telco_Churn_EDA_Feature_Engineering.ipynb”
- Add markdown sections: Insights, Why this feature?, Business implication.
- Push to GitHub (remember Chapter 2!).
- Next: Try modeling (Logistic → RandomForest → XGBoost) and compare with/without your features.
- Bonus India twist: If you find a local dataset (UPI transactions, recharge churn), adapt — same principles.
You now have a solid end-to-end EDA + Feature Engineering project — the exact thing recruiters love in 2026 resumes.
