Chapter 3: Mathematics & Statistics for Data Science
Mathematics & Statistics for Data Science, explained like we’re sitting in your favorite spot in Airoli — maybe near that small park with the breeze — and I’m your teacher drawing on a notebook (or whiteboard if we had one). No rushing through formulas; we’ll build intuition first, then math, then real data science examples you can picture right away. This is the “why” chapter — understand this, and later when you see gradient descent in ML or PCA in dimensionality reduction, it won’t feel like magic.
We’ll cover exactly what you listed, with 2026 relevance (things like why gradients still matter even with fancy AutoML tools, or how probability helps debug GenAI hallucinations).
1. Linear Algebra (Vectors, Matrices, Eigenvalues)
Linear algebra is the geometry of data. In data science, almost everything is multi-dimensional: a customer’s features (age, income, purchases, location) = a point in high-dimensional space. Models move/rotate/scale these points.
Vectors — Arrows with direction & length. In DS: a single data point or feature weights.
Example: Imagine customer data as a vector customer_A = [28, 45000, 3, 1] # age, income (₹), orders last year, has_credit_card (yes=1)
- Length (norm): How “far” from origin → magnitude of features.
- Dot product: Similarity measure (used in recommendation systems, cosine similarity for Netflix-style “similar users”).
In Python (NumPy — you’ll use this daily):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
import numpy as np vec1 = np.array([28, 45000, 3, 1]) vec2 = np.array([30, 48000, 4, 1]) dot_product = np.dot(vec1, vec2) # high = similar customers cosine_sim = dot_product / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) print(f"Similarity: {cosine_sim:.3f}") |
Matrices — 2D arrays of vectors. In DS: entire dataset (rows = samples, columns = features).
Example: Your dataset matrix X (1000 customers × 4 features). Matrix multiplication = linear transformation (e.g., projecting data, neural network layers).
Key operations:
- Transpose (flip rows/columns): X.T
- Inverse (if square & invertible): solve systems like weights in linear regression (but rarely compute directly due to numerical issues).
- Multiplication: weights update in neural nets.
Eigenvalues & Eigenvectors — “Special directions” where matrix only stretches/shrinks (no rotation).
Why care?
- PCA (Principal Component Analysis): Finds directions of maximum variance → dimensionality reduction. Eigenvectors = principal components; eigenvalues = how much variance each explains.
- In recommendation systems (SVD/matrix factorization): decompose user-item matrix.
- Stability in deep learning (spectral analysis of weight matrices).
Real 2026 example: In fraud detection, high-dimensional transaction data → PCA reduces 100+ features to 10 principal ones (eigenvectors) while keeping 95% variance → faster models, less overfitting.
Quick intuition exercise: Think of a photo (matrix of pixels). Eigenfaces in face recognition = eigenvectors of face images → capture main variations (eyes wide/narrow, smile, etc.).
2. Calculus Basics (Derivatives, Gradients — Needed for ML)
Calculus = how things change. In ML, we want models to “learn” by minimizing error → find where change is zero (minimum).
Derivative — Slope of a function at a point. Tells direction & speed of change.
Example: Cost function J(θ) = error as function of parameters θ (weights). We want smallest error → derivative dJ/dθ = 0 at minimum.
Simple: f(x) = x² → derivative f'(x) = 2x At x=3, slope=6 (steep positive → go left to decrease). At x=0, slope=0 → minimum!
Gradient — Multi-variable derivative (vector of partial derivatives). Points to steepest increase → we go opposite for descent.
Gradient Descent (heart of training almost all ML/DL models in 2026):
|
0 1 2 3 4 5 6 7 8 9 |
# Pseudo-code while not converged: grad = compute_gradient(cost_function, current_weights) # partial derivatives weights = weights - learning_rate * grad # step opposite to gradient |
Why still crucial in 2026? Even with AdamW optimizer or LoRA fine-tuning LLMs, you understand why loss decreases, why vanishing/exploding gradients happen, why learning rate matters. In GenAI debugging: gradient norms tell if model is learning properly.
Chain rule: Super important for backpropagation in neural nets. If loss = f(g(h(θ))), derivative flows backward: dloss/dθ = dloss/df × df/dg × dg/dh × dh/dθ
Intuition example: Imagine training a model to predict house prices in Navi Mumbai. Features: size, location score, age. Gradient tells: “Increase weight for size by 0.02, decrease for age by 0.01” to reduce error fastest.
3. Probability (Distributions, Bayes’ Theorem, Conditional Probability)
Probability = quantifying uncertainty. Data science = dealing with noisy, incomplete world.
Basic rules
- P(A or B) = P(A) + P(B) – P(A and B)
- P(A and B) = P(A) × P(B|A) (multiplication rule)
Conditional Probability P(A|B) = “probability of A given B already happened”
Example: P(rain | clouds) = high, even if P(rain) alone = low in dry season.
Bayes’ Theorem — Flip conditional probability. Gold for updating beliefs with new evidence.
P(A|B) = [P(B|A) × P(A)] / P(B)
Real DS example — Spam filter (Naive Bayes classifier still used in 2026 for baselines): P(Spam | “win lottery”) = [P(“win lottery” | Spam) × P(Spam)] / P(“win lottery”) “win lottery” appears 80% in spam → high P(Spam | words).
Another: Medical test. Test 99% accurate, disease rare (0.1%). Positive test → what’s real probability of disease? Bayes says often low (false positives dominate) — classic interview question.
Probability Distributions — Shapes that describe how likely values are.
Common ones in DS:
- Normal (Gaussian) — Bell curve. Mean μ, std dev σ. Why? Central Limit Theorem: averages/sums of many independent things → normal (heights, errors, many sensor readings). Example: IQ scores ~ Normal(100, 15). In ML: Assume errors normal in linear regression.
- Binomial — Number of successes in n fixed trials, each p success prob. Example: 100 coin flips → heads count ~ Binomial(100, 0.5). In DS: A/B test conversions (click yes/no), churn (customer leaves or not).
- Poisson — Number of events in fixed interval when events rare/independent. Example: Number of UPI fraud alerts per hour in your area. If average 2/hour → Poisson(λ=2). P(exactly 5 in hour) = e^{-2} × 2^5 / 5! In DS: Customer arrivals, defects per batch, website hits per minute.
Others you’ll meet: Exponential (time between events), Uniform (equal probability), Bernoulli (single yes/no).
4. Descriptive & Inferential Statistics
Descriptive — Summarize data you have.
- Mean (average): Sensitive to outliers. Salary mean skewed by CEO pay.
- Median (middle value): Robust. Better for income data in India.
- Variance = average squared distance from mean → spread.
- Standard Deviation = sqrt(variance) → same units as data (₹).
Example: Customer spends in your Airoli shop last month: [120, 150, 8000, 200, 180] Mean ≈ 1730 (pulled high by outlier), Median = 180 → real typical spend.
Inferential — Draw conclusions about population from sample.
- Hypothesis Testing — “Is this effect real or noise?” Null H₀: No difference (e.g., new ad = old ad conversion). Alternative H₁: There is difference. Compute test statistic → p-value.
- p-value — Probability of seeing data (or more extreme) assuming H₀ true. p < 0.05 → reject H₀ (usually). But 2026 nuance: Don’t worship 0.05 blindly; report effect size too.
- Confidence Intervals — Range where true value likely lies (e.g., “Mean conversion 4.2% ± 0.8%, 95% CI”). Better than p-value alone — shows precision.
Example: A/B test on UPI offer. Sample A: 5.1% conversion (n=2000), Sample B: 6.3% (n=2000). t-test → p=0.012 → statistically significant. But CI for lift: 0.4% to 2.0% → practically small? Depends on cost.
That’s Chapter 3 — the math engine room! Master this → you’ll debug models better, interpret results confidently, and shine in interviews (Bayes, gradients, PCA questions are everywhere).
