Chapter 13: Correlations
What is Correlation? (very simple first explanation)
Correlation measures how two variables move together.
- If both go up together → positive correlation
- If one goes up and the other goes down → negative correlation
- If they have no pattern → correlation close to 0
The most common number people use is Pearson correlation coefficient (r), which ranges from -1 to +1:
| Value | Meaning | Real-life example |
|---|---|---|
| +1.0 | Perfect positive correlation | Height in cm and height in inches |
| +0.8 to +0.9 | Strong positive correlation | Study hours and exam marks |
| +0.3 to +0.7 | Moderate positive correlation | House size and house price |
| 0 to +0.3 | Weak / almost no positive correlation | Shoe size and monthly phone bill |
| ~0 | No linear relationship | Temperature and favorite color |
| -0.3 to -0.7 | Moderate negative correlation | Price of a product and number of units sold |
| -0.8 to -0.9 | Strong negative correlation | Speed of car and time taken to reach destination |
| -1.0 | Perfect negative correlation | Amount of fuel left and distance already driven |
Let’s create a realistic dataset to play with
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
import pandas as pd import numpy as np # Create a student performance dataset with realistic correlations np.random.seed(42) # for reproducible results n = 50 data = { 'student_id': range(1001, 1001 + n), 'study_hours_per_week': np.random.uniform(5, 40, n).round(1), 'sleep_hours_per_day': np.random.uniform(4, 9, n).round(1), 'attendance_percent': np.random.uniform(60, 98, n).round(1), 'online_video_hours': np.random.uniform(0, 20, n).round(1), 'part_time_job_hours': np.random.uniform(0, 30, n).round(1), 'math_marks': np.nan, 'science_marks': np.nan, 'english_marks': np.nan, 'total_marks': np.nan } df = pd.DataFrame(data) # Create realistic relationships df['math_marks'] = ( 25 + df['study_hours_per_week'] * 2.1 + df['attendance_percent'] * 0.8 - df['part_time_job_hours'] * 1.2 + df['online_video_hours'] * 0.4 + np.random.normal(0, 8, n) # add some noise ).clip(0, 100).round(1) df['science_marks'] = ( 30 + df['study_hours_per_week'] * 1.8 + df['attendance_percent'] * 0.9 - df['part_time_job_hours'] * 1.1 + np.random.normal(0, 9, n) ).clip(0, 100).round(1) df['english_marks'] = ( 40 + df['study_hours_per_week'] * 1.2 + df['attendance_percent'] * 0.7 + np.random.normal(0, 10, n) ).clip(0, 100).round(1) df['total_marks'] = (df['math_marks'] + df['science_marks'] + df['english_marks']).round(1) # Show first few rows df.head(10) |
Step 1 – The easiest way to see all correlations
|
0 1 2 3 4 5 6 7 8 9 10 |
# Most common command — correlation matrix correlation_matrix = df.corr(numeric_only=True) # Show it nicely rounded correlation_matrix.round(3) |
Typical output (values will vary slightly because of random noise):
| study_hours_per_week | sleep_hours_per_day | attendance_percent | online_video_hours | part_time_job_hours | math_marks | science_marks | english_marks | total_marks | |
|---|---|---|---|---|---|---|---|---|---|
| study_hours_per_week | 1.000 | -0.042 | 0.118 | -0.075 | -0.089 | 0.892 | 0.841 | 0.712 | 0.868 |
| sleep_hours_per_day | -0.042 | 1.000 | -0.031 | 0.102 | 0.065 | -0.028 | -0.041 | 0.019 | -0.022 |
| attendance_percent | 0.118 | -0.031 | 1.000 | -0.045 | -0.134 | 0.689 | 0.734 | 0.601 | 0.702 |
| online_video_hours | -0.075 | 0.102 | -0.045 | 1.000 | 0.078 | 0.112 | 0.089 | 0.065 | 0.098 |
| part_time_job_hours | -0.089 | 0.065 | -0.134 | 0.078 | 1.000 | -0.621 | -0.589 | -0.412 | -0.578 |
| math_marks | 0.892 | -0.028 | 0.689 | 0.112 | -0.621 | 1.000 | 0.912 | 0.789 | 0.958 |
| science_marks | 0.841 | -0.041 | 0.734 | 0.089 | -0.589 | 0.912 | 1.000 | 0.821 | 0.942 |
| english_marks | 0.712 | 0.019 | 0.601 | 0.065 | -0.412 | 0.789 | 0.821 | 1.000 | 0.905 |
| total_marks | 0.868 | -0.022 | 0.702 | 0.098 | -0.578 | 0.958 | 0.942 | 0.905 | 1.000 |
Step 2 – Understanding what we see
Strong positive correlations (0.8+):
- study_hours_per_week → math_marks (0.892)
- study_hours_per_week → science_marks (0.841)
- math_marks ↔ science_marks (0.912)
- total_marks is strongly related to all subject marks (obvious)
Moderate positive correlations:
- attendance_percent → all marks (~0.6 to 0.73)
Strong negative correlations:
- part_time_job_hours → math_marks (-0.621)
- part_time_job_hours → total_marks (-0.578)
Almost no correlation:
- sleep_hours_per_day → almost everything (~0)
- online_video_hours → marks (~0.1)
Step 3 – Best ways to visualize correlations
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import seaborn as sns import matplotlib.pyplot as plt # Heatmap - most popular way plt.figure(figsize=(12, 10)) sns.heatmap( correlation_matrix.round(2), annot=True, cmap='coolwarm', vmin=-1, vmax=1, linewidths=0.5, fmt='.2f' ) plt.title('Correlation Heatmap - Student Performance') plt.show() |
Alternative: smaller focused view (only important columns)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
important_cols = ['study_hours_per_week', 'attendance_percent', 'part_time_job_hours', 'math_marks', 'science_marks', 'english_marks', 'total_marks'] plt.figure(figsize=(8, 6)) sns.heatmap( df[important_cols].corr().round(2), annot=True, cmap='RdBu_r', vmin=-1, vmax=1, fmt='.2f' ) plt.title('Focused Correlation - Key Factors & Marks') plt.show() |
Step 4 – Most common correlation methods in pandas
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Default = Pearson correlation (linear relationship) df.corr(method='pearson') # same as df.corr() # Spearman - good for ordinal data or non-linear monotonic relationships df.corr(method='spearman') # Kendall - another rank-based correlation (less sensitive to ties) df.corr(method='kendall') |
When to choose which?
| Method | Best for | Sensitive to outliers? | Assumes linear? |
|---|---|---|---|
| Pearson | Continuous data, linear relationships | Yes | Yes |
| Spearman | Ordinal data, non-linear but monotonic | Less | No |
| Kendall | Small samples, ordinal data | Less | No |
Step 5 – Quick practical questions you can answer with correlation
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Which factor is most related to total marks? df.corr()['total_marks'].sort_values(ascending=False).round(3) # Is sleep really not related to marks? df[['sleep_hours_per_day', 'total_marks']].corr().round(3) # Do students who study more sleep less? (negative correlation?) df[['study_hours_per_week', 'sleep_hours_per_day']].corr().round(3) |
Step 6 – Important warnings & common mistakes
- Correlation ≠ Causation → High correlation does not mean one causes the other → Example: Ice cream sales and drowning deaths are positively correlated — both go up in summer (third variable: temperature)
- Outliers can destroy correlation Try this experiment:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
df_with_outlier = df.copy() df_with_outlier.loc[0, 'study_hours_per_week'] = 200 df_with_outlier.loc[0, 'math_marks'] = 99 print("Before outlier:") print(df[['study_hours_per_week', 'math_marks']].corr().round(3)) print("\nAfter extreme outlier:") print(df_with_outlier[['study_hours_per_week', 'math_marks']].corr().round(3)) |
- Non-linear relationships are missed by Pearson (example: quadratic relationship, U-shape)
- Too many columns = messy heatmap → Always select only meaningful columns first
Your turn — small homework exercises
- Calculate correlation between part_time_job_hours and all marks
- Find the two variables that have the strongest negative correlation
- Create a correlation heatmap only for columns related to marks and study factors
- Add a new column revision_hours = study_hours_per_week * 0.6 + random noise → Check how strongly it correlates with marks
Try these — and feel free to share your code/output or ask what went wrong.
Where do you want to go next?
- Scatter plots to visually understand correlations
- How to test significance of correlation (p-value)
- Partial correlation (control for third variable)
- Correlation vs Causation real-life examples
- Dealing with missing values when calculating correlation
- When correlation is misleading (classic traps)
Just tell me which direction you want to explore next — we’ll go slowly and deeply with examples. 😊
