Chapter 14: Pandas Correlations
What actually is “correlation” in data analysis?
Correlation tells us how strongly two numeric variables move together and in which direction.
Common interpretations of the Pearson correlation coefficient (r):
| r value | What it means | Real-world example |
|---|---|---|
| +0.90 – +1.00 | Very strong positive correlation | Study hours vs exam score |
| +0.70 – +0.89 | Strong positive | House size vs house price |
| +0.40 – +0.69 | Moderate positive | Advertising budget vs sales |
| +0.10 – +0.39 | Weak positive | Number of books read vs vocabulary size |
| ~0.00 | Almost no linear relationship | Shoe size vs favorite color |
| -0.10 – -0.39 | Weak negative | Price of item vs number sold |
| -0.40 – -0.69 | Moderate negative | Temperature vs ice cream sales (in winter context) |
| -0.70 – -0.89 | Strong negative | Hours worked overtime vs free time |
| -0.90 – -1.00 | Very strong negative | Distance travelled vs fuel remaining |
Very important reminders (say this sentence out loud 3 times):
- Correlation ≠ Causation
- High correlation does not mean one thing causes the other
- A third hidden variable can create strong correlation
1. Realistic example dataset
Let’s create a small but realistic student dataset where we expect some interesting correlations.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
import pandas as pd import numpy as np np.random.seed(42) # so you get the same numbers n_students = 60 df = pd.DataFrame({ 'student_id': range(1001, 1001 + n_students), 'study_hours_week': np.random.uniform(4, 38, n_students).round(1), 'sleep_hours_day': np.random.uniform(4.5, 9.5, n_students).round(1), 'attendance_%': np.random.uniform(62, 98, n_students).round(1), 'screen_time_day': np.random.uniform(2, 11, n_students).round(1), 'exercise_hours_week':np.random.uniform(0, 12, n_students).round(1), 'math_score': np.nan, 'physics_score': np.nan, 'english_score': np.nan, 'overall_%': np.nan }) # Create realistic (noisy) relationships df['math_score'] = ( 18 + df['study_hours_week'] * 2.15 + df['attendance_%'] * 0.78 - df['screen_time_day'] * 1.45 + np.random.normal(0, 7.5, n_students) ).clip(0, 100).round(1) df['physics_score'] = ( 22 + df['study_hours_week'] * 1.95 + df['attendance_%'] * 0.85 - df['screen_time_day'] * 1.1 + df['exercise_hours_week'] * 0.6 + np.random.normal(0, 8.2, n_students) ).clip(0, 100).round(1) df['english_score'] = ( 35 + df['study_hours_week'] * 1.35 + df['attendance_%'] * 0.65 - df['screen_time_day'] * 0.8 + np.random.normal(0, 9.5, n_students) ).clip(0, 100).round(1) df['overall_%'] = ( (df['math_score'] + df['physics_score'] + df['english_score']) / 3 ).round(1) # Show first 8 rows print(df.head(8)) |
2. The most important command: .corr()
|
0 1 2 3 4 5 6 7 8 9 10 |
# Quick full correlation matrix (only numeric columns) corr = df.corr(numeric_only=True) # Nicer display corr.round(3) |
You will typically see something like this (values depend on random seed):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
study_hours_week sleep_hours_day attendance_% screen_time_day exercise_hours_week math_score physics_score english_score overall_% study_hours_week 1.000 -0.058 0.092 -0.067 0.034 0.887 0.851 0.734 0.862 sleep_hours_day -0.058 1.000 -0.039 0.121 -0.012 -0.031 -0.018 0.045 -0.004 attendance_% 0.092 -0.039 1.000 -0.081 0.078 0.702 0.741 0.623 0.712 screen_time_day -0.067 0.121 -0.081 1.000 -0.145 -0.652 -0.589 -0.481 -0.598 exercise_hours_week 0.034 -0.012 0.078 -0.145 1.000 0.128 0.192 0.109 0.151 math_score 0.887 -0.031 0.702 -0.652 0.128 1.000 0.904 0.792 0.947 physics_score 0.851 -0.018 0.741 -0.589 0.192 0.904 1.000 0.821 0.936 english_score 0.734 0.045 0.623 -0.481 0.109 0.792 0.821 1.000 0.902 overall_% 0.862 -0.004 0.712 -0.598 0.151 0.947 0.936 0.902 1.000 |
3. How to read this table (most important part)
Look at the last few rows / columns — they show how everything relates to the scores.
Key observations:
- study_hours_week has very strong positive correlation with all scores (~0.73–0.89)
- attendance_% also has strong positive correlation (~0.62–0.74)
- screen_time_day has strong negative correlation with scores (~ -0.48 to -0.65)
- sleep_hours_day has almost zero correlation with academic performance
- exercise_hours_week has weak positive correlation (small benefit visible)
4. Best ways to visualize correlations
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import seaborn as sns import matplotlib.pyplot as plt # --------------- Heatmap (most common style) --------------- plt.figure(figsize=(11, 9)) sns.heatmap( corr.round(2), annot=True, # show numbers fmt='.2f', cmap='coolwarm', # red = positive, blue = negative vmin=-1, vmax=1, # force full range linewidths=0.6, linecolor='white', annot_kws={"size": 10} ) plt.title('Correlation Matrix – Student Performance Factors', fontsize=14, pad=15) plt.tight_layout() plt.show() |
Alternative: smaller focused heatmap (often better)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
key_columns = [ 'study_hours_week', 'attendance_%', 'screen_time_day', 'exercise_hours_week', 'math_score', 'physics_score', 'english_score', 'overall_%' ] plt.figure(figsize=(8, 7)) sns.heatmap( df[key_columns].corr().round(2), annot=True, cmap='RdBu_r', vmin=-1, vmax=1, fmt='.2f', linewidths=0.5 ) plt.title('Focused: What really affects marks?') plt.show() |
5. Most useful correlation patterns people use every day
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# 1. Which variables are most related to overall score? df.corr(numeric_only=True)['overall_%'].sort_values(ascending=False).round(3) # 2. Only show strong correlations (|r| > 0.5) strong_corr = corr[abs(corr) > 0.5].dropna(how='all', axis=0).dropna(how='all', axis=1) print(strong_corr.round(3)) # 3. Spearman correlation (better for non-linear or ordinal data) df.corr(method='spearman', numeric_only=True).round(3) # 4. Correlation just between two columns print("Study hours vs Math score:", df['study_hours_week'].corr(df['math_score']).round(3)) |
6. Very important warnings (repeat after me)
- Correlation does not imply causation
- Outliers can dramatically change correlation
Try this experiment:
|
0 1 2 3 4 5 6 7 8 9 10 11 |
df_outlier = df.copy() df_outlier.loc[0, 'study_hours_week'] = 180 df_outlier.loc[0, 'math_score'] = 99.9 print("Normal correlation:", df['study_hours_week'].corr(df['math_score']).round(3)) print("With extreme outlier:", df_outlier['study_hours_week'].corr(df_outlier['math_score']).round(3)) |
→ One crazy value can destroy or fake a relationship!
7. Quick exercises for you (try them now)
- Which variable has the strongest negative correlation with overall_%?
- Is there any meaningful correlation between sleep_hours_day and marks?
- Create a correlation table only for the three subject scores + overall
- Try Spearman instead of Pearson — do the values change a lot?
If you want, paste your answers or questions — we can discuss them.
Where would you like to go next?
- Scatter plots + regression lines (visual understanding)
- How to test if correlation is statistically significant (p-value)
- Partial correlation (removing effect of third variable)
- Correlation vs Causation – classic real-world mistakes
- Correlation heatmap with conditional formatting in pandas
- When Pearson fails – non-linear relationships examples
Just tell me which topic you want to dive into next — we’ll go slowly and deeply with examples. 😊
