Chapter 14: Capstone Projects & Portfolio Building
Capstone Projects & Portfolio Building, explained like we’re sitting in your favorite Airoli spot one last time — it’s January 29, 2026, around 6:15 PM IST, the sky is dark, we’ve been at this for months, and now it’s time to turn everything we’ve learned into tangible proof that gets you interviews and offers in Hyderabad, Mumbai, Bangalore, or remote roles in 2026.
This chapter isn’t about “do one project and done.” In 2026 India DS/ML job market (especially mid-level track), recruiters and hiring managers look for:
- 3–5 high-quality, end-to-end projects on GitHub (clean code, READMEs, live demos if possible)
- Business impact framing (not just accuracy — “reduced simulated churn by 18%”)
- Production thinking (deployment, Docker, API, monitoring basics)
- Diversity — one tabular predictive, one NLP, one CV/time-series, one deployed app
- Storytelling — clear problem → data → EDA → features → model → evaluation → deployment → learnings
Let’s build your recommended 4–5 strong capstone projects. Each one is realistic, uses skills from Chapters 1–13, and is portfolio gold in 2026.
Project 1: Predictive Modeling – Customer Churn Prediction (Tabular Classic)
Why this one? Churn is everywhere in India (telecom, fintech, SaaS, e-commerce). Shows full supervised ML pipeline.
Dataset (use the one we worked on): Telco Customer Churn (Kaggle) or synthetic Indian telecom version (add Hindi/Marathi columns if you want flair).
End-to-end structure (in one clean GitHub repo):
- Problem — Predict which customers will churn next month (business: retention offers save ₹ crores)
- Data — 7k rows, 20+ features
- EDA — imbalance, contract type strongest signal (Month-to-month churn 42% vs 2-year 3%)
- Feature Engineering — tenure bins, service bundle count, charges trend, family flag
- Modeling — Logistic → RF → XGBoost/LightGBM/CatBoost (ensemble or stack for +1–2% AUC)
- Evaluation — Recall 0.82, Precision 0.75, AUC 0.875 (focus on recall for business)
- Deployment — FastAPI endpoint + Streamlit dashboard (input customer details → churn risk + retention suggestion)
- Bonus — MLflow tracking, Docker container, drift simulation (add noise to test data → show alert)
GitHub README sections:
- Business Problem
- Tech Stack (Python, Pandas, Scikit-learn, XGBoost, FastAPI, Streamlit, Docker)
- Live Demo link (Render/Fly.io/Hugging Face Spaces)
- Results table (model comparison)
- Learnings (imbalance handling, feature importance, production readiness)
Expected impact: This project alone gets callbacks from fintech/telecom companies.
Project 2: NLP – Multilingual Sentiment Analysis + Complaint Categorization (India-relevant)
Why? Customer reviews, social media, support tickets — huge in e-commerce, food delivery, banking.
Datasets:
- Flipkart Product Reviews (multilingual) or Amazon India reviews (Kaggle)
- Twitter/X Hindi-English complaints (scrape via tools or use existing dataset)
- Or combine: multilingual customer feedback
Pipeline:
- Preprocess: minimal for transformers (raw text best)
- Model: Fine-tune ai4bharat/indic-bert or bert-base-multilingual-uncased (handles Hinglish)
- Tasks:
- Sentiment (positive/neutral/negative)
- Category (Billing, Network, App, Delivery, Fraud, Other)
- Evaluation: F1-macro (imbalanced categories), confusion matrix
- Deployment: Streamlit/Gradio app — paste review → get sentiment + category + confidence
Advanced touch:
- Zero-shot with larger model (e.g., SetFit or prompt-based)
- Attention visualization (highlight keywords driving decision)
- Handle code-mixed text (common in India)
Portfolio wow factor: “Analyzed 10k+ Hindi/English reviews → 88% F1 on categorization → deployed interactive demo”
Project 3: Computer Vision Mini-Project – Product Defect Detection or Waste Classification
Why? CV is exploding in manufacturing, retail, agriculture (India: quality control in textiles/food, smart farming).
Dataset options:
- Kaggle: Industrial Product Defect Detection
- TACO dataset (trash classification — environmental angle)
- Or collect small dataset (phone camera photos of fruits/vegetables defects)
Pipeline:
- Data: 1k–5k images, 3–5 classes (e.g., good/defective, or plastic/organic/metal)
- Augmentation: Albumentations (rotate, flip, brightness, cutout)
- Model: Transfer learning — EfficientNet-B0/B3 or ConvNeXt-Tiny (2026 efficient)
- Framework: PyTorch + timm library
- Train: freeze base → train head → fine-tune last layers
- Evaluation: Accuracy, F1, confusion matrix, Grad-CAM visualization (show what model “sees”)
- Deployment: Gradio/Streamlit — upload photo → “Defective – crack detected (92%)”
Portfolio highlight: “Built defect detection model → 94% F1 → Grad-CAM explains decisions → Dockerized API for factory integration”
Project 4: Time-Series Forecasting – Sales / Demand / UPI Transaction Volume Prediction
Why? Time-series is everywhere: retail sales (Diwali spike), UPI volume, stock/recharge prediction.
Datasets:
- Kaggle: Store Item Demand Forecasting (Walmart-style)
- India-specific: daily UPI transaction volume (RBI public data or synthetic)
- Or Flipkart/Amazon sales time-series
Pipeline:
- EDA: seasonality, trend, stationarity (ADF test)
- Classical: Prophet (easy seasonality/holidays) + ARIMA/SARIMA
- ML: XGBoost with lag features, rolling stats, date features (day of week, month, festive flag)
- DL: LSTM/Transformer (Temporal Fusion Transformer if ambitious)
- Evaluation: MAPE, RMSE, MASE
- Deployment: Streamlit dashboard — forecast next 30 days, show confidence intervals
Bonus: Add external regressors (e.g., holiday calendar, fuel price for demand)
Portfolio story: “Forecasted Diwali sales spike → MAPE 8.2% → helped simulated inventory planning”
Project 5: Deployed Web App – Full End-to-End Churn + Recommendation System (Capstone Showpiece)
Combine skills:
- Use churn model + add simple recommendation (collaborative filtering or content-based on services used)
- Frontend: Streamlit or Dash
- Backend: FastAPI
- Container: Docker
- Tracking: MLflow or W&B
- Hosting: Render, Fly.io, Railway, Hugging Face Spaces (free tier)
- Monitoring stub: simple drift check (KS test on new data)
Live demo flow:
- User inputs customer profile
- Predict churn risk
- If high risk → suggest personalized retention (e.g., “Offer 20% off 6-month plan”)
- Show feature importance plot
Deployment checklist:
- GitHub repo with Dockerfile, requirements.txt, .github/workflows (CI/CD if possible)
- README with architecture diagram (draw.io or excalidraw)
- Video walkthrough (Loom 3–5 min)
Portfolio Building Tips (2026 India Reality)
GitHub Structure (per project + main portfolio repo):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
yourname-ds-portfolio/ ├── README.md ← overview + links to all projects ├── churn-prediction/ │ ├── notebooks/ │ ├── src/ ← preprocess.py, model.py, api.py │ ├── models/ │ ├── app.py ← Streamlit/FastAPI │ ├── Dockerfile │ └── README.md ├── nlp-sentiment/ └── ... |
Main README:
- Photo (professional)
- 1-paragraph bio: “Data Scientist passionate about impactful ML in fintech/telecom”
- Tech stack icons
- 4–5 project cards with GIFs/screenshots + live links
- Resume PDF link
- Contact (LinkedIn, email)
Resume/LinkedIn:
- List projects under “Projects” section (not just “personal projects”)
- Quantify: “Built churn model → AUC 0.875 → deployed API serving 100+ req/min simulation”
- Add badges (Python, Docker, AWS/GCP badge if certified)
Where to host live demos (free/cheap 2026):
- Render.com / Railway.app / Fly.io — free tier for small apps
- Hugging Face Spaces — best for ML demos (Gradio/Streamlit)
- Streamlit Community Cloud — free for public apps
Final advice from me (your Airoli mentor): Pick 3 projects minimum — churn (tabular), NLP sentiment (text), one CV or time-series. Make them end-to-end + deployed. Record 2–3 min Loom videos explaining each. Apply aggressively — Naukri, LinkedIn, AngelList, Wellfound. Tailor resume per job (highlight telecom/fintech if applying there).
You’ve got the skills — now show the world.
This completes our full roadmap from Chapter 1 to 14!
Want me to help polish one specific project README, suggest exact datasets/links, review your GitHub structure, or give 2026 interview question prep for these projects? Or maybe a final “career roadmap 2026–2028” summary? Just say the word — I’m here. 🚀
