Chapter 11: SciPy Statistical Significance Tests
Statistical significance tests in SciPy, the way I would explain it if we were sitting together with a Jupyter notebook open and some real data on the screen.
What people mean by “SciPy Statistical Significance Tests” = the hypothesis testing functions inside scipy.stats
These are tools that help you answer questions like:
- “Is the average height in this group really different from 170 cm?”
- “Do drug A and drug B give different recovery times?”
- “Is this coin fair, or is it biased?”
- “Are these two samples drawn from the same distribution?”
- “Do the variances differ between groups?”
Almost all of them follow this pattern:
- You give data (one or more samples)
- You get a test statistic (a number that measures how extreme the data looks under the null hypothesis)
- You get a p-value (probability of seeing data this extreme — or more extreme — if the null hypothesis is true)
- Often also: confidence intervals, effect sizes, etc.
Rule of thumb in 2026: If p < 0.05 (or your chosen α), we say the result is statistically significant (reject null hypothesis). But always report the actual p-value, effect size, and think about practical importance — never just “significant / not significant”.
Most Popular Hypothesis Tests in scipy.stats (SciPy 1.17.0 — early 2026)
| Test name / purpose | Function | Null hypothesis (H₀) | When to use it (real-world example) | Parametric? | Paired? |
|---|---|---|---|---|---|
| One-sample t-test | ttest_1samp | Mean = given value | Is average IQ in class = 100? | Yes | — |
| Independent two-sample t-test | ttest_ind | Means of two groups are equal | Do men and women differ in average salary? | Yes | No |
| Paired t-test | ttest_rel | Mean difference = 0 (paired/related samples) | Before vs after treatment on same patients | Yes | Yes |
| Mann-Whitney U (rank-sum) test | mannwhitneyu | Distributions are the same (stochastically equal) | Non-normal data, compare two independent groups | No | No |
| Wilcoxon signed-rank test | wilcoxon | Median difference = 0 (paired) | Non-normal paired data | No | Yes |
| Kolmogorov-Smirnov test (1-sample or 2-sample) | kstest | Sample follows given dist / two samples same dist | Goodness-of-fit or compare distributions | No | — |
| Normal distribution test | normaltest | Sample comes from normal distribution | Check normality assumption before t-test | — | — |
| Shapiro-Wilk test (normality) | shapiro | Sample comes from normal distribution | Alternative normality test (good for n < 5000) | — | — |
| Chi-square test of independence | chi2_contingency | Variables in contingency table are independent | Is gender independent of voting preference? | — | — |
| Fisher’s exact test | fisher_exact | No association in 2×2 table | Small counts in contingency table | — | — |
| Permutation test (very flexible) | permutation_test | Statistic same under random permutations | Custom statistic, no parametric assumption | No | Varies |
Let’s do real, copy-paste examples (Jupyter style)
Always start your notebook like this:
|
0 1 2 3 4 5 6 7 8 |
import numpy as np from scipy import stats import matplotlib.pyplot as plt |
Example 1 — One-sample t-test (classic beginner test)
Question: Is the mean reaction time in this experiment significantly different from 250 ms (industry standard)?
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# Fake data: reaction times in ms (n=25 subjects) np.random.seed(42) reaction_times = np.random.normal(loc=265, scale=18, size=25) # slightly slower t_stat, p_value = stats.ttest_1samp(reaction_times, popmean=250) print(f"t-statistic = {t_stat:.3f}") print(f"p-value = {p_value:.5f}") # Interpretation alpha = 0.05 if p_value < alpha: print("Reject H₀ → mean significantly differs from 250 ms") else: print("Fail to reject H₀ → no significant difference") |
→ Typical output: p ≈ 0.000 something → significant slowing.
Also get confidence interval:
|
0 1 2 3 4 5 6 7 |
ci = stats.ttest_1samp(reaction_times, 250).confidence_interval(confidence_level=0.95) print(f"95% CI for mean: {ci}") |
Example 2 — Independent t-test (ttest_ind) — most used in papers
Question: Do two teaching methods give different test scores?
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
method_A = np.random.normal(78, 9, 30) # traditional method_B = np.random.normal(84, 10, 28) # new method # Assume equal variance (default) t_stat, p_val = stats.ttest_ind(method_A, method_B) print(f"t = {t_stat:.3f}, p = {p_val:.5f}") # If variances may differ → Welch's t-test (recommended default now) t_stat_w, p_val_w = stats.ttest_ind(method_A, method_B, equal_var=False) print(f"Welch version: t = {t_stat_w:.3f}, p = {p_val_w:.5f}") |
→ If p < 0.05 → evidence that method B gives higher scores.
Example 3 — Paired t-test (before-after on same subjects)
|
0 1 2 3 4 5 6 7 8 9 10 |
before = np.random.normal(72, 8, 40) after = before + np.random.normal(6, 4, 40) # average +6 points t, p = stats.ttest_rel(after, before) # note order: after - before print(f"Paired t-test: t = {t:.3f}, p = {p:.5f}") |
→ Very powerful because removes between-subject variability.
Example 4 — Non-parametric: Mann-Whitney U (when data not normal)
|
0 1 2 3 4 5 6 7 8 9 10 |
group1 = np.random.lognormal(mean=4, sigma=0.8, size=35) # skewed group2 = np.random.lognormal(mean=4.3, sigma=0.7, size=42) stat, p = stats.mannwhitneyu(group1, group2, alternative='two-sided') print(f"U = {stat}, p = {p:.5f}") |
→ Good when normality assumption fails.
Example 5 — Quick normality check before parametric tests
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
data = np.random.normal(0,1,200) # change to uniform → see difference stat, p = stats.normaltest(data) print(f"Normaltest: statistic={stat:.2f}, p={p:.5f}") # Shapiro-Wilk (often preferred for small n) w, p_sh = stats.shapiro(data[:5000]) # shapiro max ~5000 print(f"Shapiro-Wilk: W={w:.4f}, p={p_sh:.5f}") |
→ If p > 0.05 → fail to reject normality (but never “prove” normality!)
Teacher’s Practical Advice (2026 edition)
- Always check assumptions (normality with normaltest/shapiro, equal variance with levene or bartlett)
- Prefer Welch’s t-test (equal_var=False) unless you have strong reason to assume equal variances
- For small samples / non-normal → go non-parametric (mannwhitneyu, wilcoxon)
- Report exact p-value, test statistic, degrees of freedom (when available), and effect size (Cohen’s d = stats.ttest_ind(…).statistic * np.sqrt(1/n1 + 1/n2))
- Use permutation_test when nothing else fits — very flexible
- Read the docstring! → stats.ttest_ind? in Jupyter — excellent examples
Official tutorial section (still gold in 2026): https://docs.scipy.org/doc/scipy/tutorial/stats/hypothesis_tests.html
Which test are you actually trying to run right now (or planning to)?
- Comparing two groups?
- Before-after?
- Normality check?
- Chi-square for categorical?
- Something custom?
Tell me your data/scenario and we’ll write the exact code + interpretation together. 🚀
