Chapter 10: Deep Learning & Neural Networks
Deep Learning & Neural Networks, explained like we’re huddled over your laptop in Airoli at 5:30 PM on a January evening in 2026 — the sky outside is already dark, chai is steaming, and we’re about to go from “what is a neuron?” to building something that actually works on images or sequences. This chapter is the big leap: from scikit-learn style ML to true deep learning, where models learn hierarchical features automatically.
In 2026, deep learning is everywhere — from GenAI apps in startups to production CV in Jio/Amazon warehouses. For beginners in India, PyTorch has become the go-to for learning (dynamic, Pythonic, research-friendly, Hugging Face ecosystem). TensorFlow/Keras is still strong for quick prototypes or Google Cloud deployment, but most fresh learners start with PyTorch now. We’ll use PyTorch here for examples — it’s more intuitive for debugging and custom stuff.
1. Neural Network Fundamentals
A neural network is inspired by the brain but much simpler: layers of neurons (nodes) connected by weights.
- Input layer → raw data (e.g., flattened image pixels, customer features).
- Hidden layers → learn patterns (edges → shapes → objects in CV).
- Output layer → prediction (class probabilities, regression value).
Each neuron does: z = (input · weights) + bias output = activation(z)
Forward pass — data flows forward to compute prediction. Loss — how wrong we are (e.g., CrossEntropy for classification). Goal: minimize loss by adjusting weights.
Example intuition: For churn prediction (from our Telco data), input could be [tenure, MonthlyCharges, …] → hidden layers learn “high charges + short tenure = high churn risk” → output probability of churn.
2. Activation Functions, Backpropagation, Optimizers
Activation functions — introduce non-linearity (without them, whole network = linear regression).
Common ones in 2026:
- ReLU (Rectified Linear Unit): f(x) = max(0, x) Fast, avoids vanishing gradients mostly. Default in hidden layers.
- Leaky ReLU / PReLU: f(x) = max(αx, x) (α small like 0.01) — fixes “dying ReLU” (neurons stuck at 0).
- Sigmoid: 0 to 1 — used in binary output (but vanishing gradients).
- Tanh: -1 to 1 — centered, but still vanishing.
- Softmax (multi-class output): turns logits into probabilities.
- GELU / SwiGLU (2026 favorites in transformers): smoother, better performance in large models.
Backpropagation — the learning algorithm (chain rule magic).
- Forward pass → compute loss.
- Backward pass → compute gradients (∂loss/∂weight) for every weight.
- Update weights: weight -= learning_rate * gradient.
Optimizers — smarter ways to update weights (beyond basic gradient descent).
- SGD (Stochastic GD) — simple, but slow/noisy.
- Momentum — adds velocity, smooths.
- Adam (Adaptive Moment Estimation) — most popular 2026 default: adaptive LR per parameter, momentum + RMSprop.
- AdamW — Adam + weight decay (better for transformers).
- Lion / Sophia — newer 2025–2026 optimizers (faster convergence in some cases).
In practice: Start with Adam or AdamW (lr=1e-3 or 3e-4).
3. Frameworks: TensorFlow / Keras or PyTorch
PyTorch (2026 recommendation for you):
- Dynamic computation graph → define-by-run (easy debug, print tensors mid-model).
- Feels like NumPy + autograd.
- Huge community (Hugging Face, fastai).
Keras (on TensorFlow):
- High-level, beginner-friendly (Sequential API).
- Static graph (faster on some hardware, easier deployment).
We’ll code in PyTorch — install:
|
0 1 2 3 4 5 6 |
pip install torch torchvision torchaudio |
4. CNNs for Computer Vision (Basics)
Convolutional Neural Networks excel at images (local patterns via filters/kernels).
Key layers:
- Conv2D — slides filters, detects edges/textures.
- MaxPool2D — downsamples, reduces params, translation invariance.
- BatchNorm — stabilizes training.
- Dropout — prevents overfitting.
- GlobalAvgPool / Flatten + Dense — final classification.
Simple CNN example on MNIST (handwritten digits — classic starter, 28×28 grayscale):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms from torch.utils.data import DataLoader # Data transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]) train_dataset = datasets.MNIST('data', train=True, download=True, transform=transform) test_dataset = datasets.MNIST('data', train=False, transform=transform) train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False) # Model class SimpleCNN(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1) # out: 32 channels self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1) self.pool = nn.MaxPool2d(2, 2) self.fc1 = nn.Linear(64 * 7 * 7, 128) self.fc2 = nn.Linear(128, 10) self.relu = nn.ReLU() self.dropout = nn.Dropout(0.25) def forward(self, x): x = self.pool(self.relu(self.conv1(x))) # 28→14 x = self.pool(self.relu(self.conv2(x))) # 14→7 x = x.view(-1, 64 * 7 * 7) # flatten x = self.dropout(self.relu(self.fc1(x))) x = self.fc2(x) return x model = SimpleCNN() criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # Train loop (simplified) for epoch in range(5): # usually 10–15 for >99% model.train() for data, target in train_loader: optimizer.zero_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}") # Test model.eval() correct = 0 with torch.no_grad(): for data, target in test_loader: output = model(data) pred = output.argmax(dim=1) correct += pred.eq(target).sum().item() print(f"Test Accuracy: {correct / len(test_dataset):.4f}") # Expect ~98–99% after tuning |
This CNN learns edges → curves → digit shapes automatically.
5. RNNs / LSTMs / Transformers Intro
RNN — processes sequences (time series, text), but vanishing gradients.
LSTM — adds gates (forget, input, output) → remembers long-term.
Example: Time-series churn prediction (monthly usage sequence per customer).
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
class ChurnLSTM(nn.Module): def __init__(self, input_size=5, hidden_size=64, num_layers=2): super().__init__() self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True) self.fc = nn.Linear(hidden_size, 1) self.sigmoid = nn.Sigmoid() def forward(self, x): # x: (batch, seq_len, features) e.g., 12 months of [tenure_delta, charges, ...] _, (hn, _) = self.lstm(x) out = self.fc(hn[-1]) # last hidden return self.sigmoid(out) |
Transformers (2026 dominant for sequences):
- Self-attention → parallel, captures long dependencies.
- No recurrence → faster training.
- Encoder-decoder or just encoder (BERT-style).
Intro: Use Hugging Face for real work — transformers library.
|
0 1 2 3 4 5 6 7 |
from transformers import BertTokenizer, BertForSequenceClassification # Fine-tune BERT for text-based churn (e.g., customer complaints) |
6. Transfer Learning
Use pre-trained model (trained on millions of images) → fine-tune on your small data.
Huge time-saver in 2026 (small datasets common in startups).
PyTorch example — Image classification (e.g., classify shop products or defect detection):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import torchvision.models as models from torchvision import transforms # Load pre-trained ResNet18 model = models.resnet18(weights='IMAGENET1K_V1') # Freeze base layers for param in model.parameters(): param.requires_grad = False # Replace final layer (assume 5 classes: shirt, pants, shoes, bag, other) num_features = model.fc.in_features model.fc = nn.Linear(num_features, 5) # Unfreeze last few layers for fine-tuning for param in model.layer4.parameters(): # last block param.requires_grad = True optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=0.001) # Train on your small dataset (e.g., 500–2000 images per class) |
Workflow:
- Freeze base → train classifier head (fast).
- Unfreeze top layers → lower LR (1e-4/1e-5) → fine-tune.
- Use data aug (RandomCrop, Flip, ColorJitter).
Common models: ResNet50, EfficientNet, ConvNeXt (2026 efficient ones), Vision Transformers (ViT).
That’s Chapter 10 — the gateway to modern AI!
Practice:
- Run the MNIST CNN (download data auto).
- Try transfer learning on a small Kaggle dataset (e.g., Cats vs Dogs subset).
- For sequences: Adapt LSTM to monthly Telco aggregates.
