Chapter 13: Correlations

What is Correlation? (very simple first explanation)

Correlation measures how two variables move together.

  • If both go up togetherpositive correlation
  • If one goes up and the other goes downnegative correlation
  • If they have no patterncorrelation close to 0

The most common number people use is Pearson correlation coefficient (r), which ranges from -1 to +1:

Value Meaning Real-life example
+1.0 Perfect positive correlation Height in cm and height in inches
+0.8 to +0.9 Strong positive correlation Study hours and exam marks
+0.3 to +0.7 Moderate positive correlation House size and house price
0 to +0.3 Weak / almost no positive correlation Shoe size and monthly phone bill
~0 No linear relationship Temperature and favorite color
-0.3 to -0.7 Moderate negative correlation Price of a product and number of units sold
-0.8 to -0.9 Strong negative correlation Speed of car and time taken to reach destination
-1.0 Perfect negative correlation Amount of fuel left and distance already driven

Let’s create a realistic dataset to play with

Python

Step 1 – The easiest way to see all correlations

Python

Typical output (values will vary slightly because of random noise):

study_hours_per_week sleep_hours_per_day attendance_percent online_video_hours part_time_job_hours math_marks science_marks english_marks total_marks
study_hours_per_week 1.000 -0.042 0.118 -0.075 -0.089 0.892 0.841 0.712 0.868
sleep_hours_per_day -0.042 1.000 -0.031 0.102 0.065 -0.028 -0.041 0.019 -0.022
attendance_percent 0.118 -0.031 1.000 -0.045 -0.134 0.689 0.734 0.601 0.702
online_video_hours -0.075 0.102 -0.045 1.000 0.078 0.112 0.089 0.065 0.098
part_time_job_hours -0.089 0.065 -0.134 0.078 1.000 -0.621 -0.589 -0.412 -0.578
math_marks 0.892 -0.028 0.689 0.112 -0.621 1.000 0.912 0.789 0.958
science_marks 0.841 -0.041 0.734 0.089 -0.589 0.912 1.000 0.821 0.942
english_marks 0.712 0.019 0.601 0.065 -0.412 0.789 0.821 1.000 0.905
total_marks 0.868 -0.022 0.702 0.098 -0.578 0.958 0.942 0.905 1.000

Step 2 – Understanding what we see

Strong positive correlations (0.8+):

  • study_hours_per_week → math_marks (0.892)
  • study_hours_per_week → science_marks (0.841)
  • math_marks ↔ science_marks (0.912)
  • total_marks is strongly related to all subject marks (obvious)

Moderate positive correlations:

  • attendance_percent → all marks (~0.6 to 0.73)

Strong negative correlations:

  • part_time_job_hours → math_marks (-0.621)
  • part_time_job_hours → total_marks (-0.578)

Almost no correlation:

  • sleep_hours_per_day → almost everything (~0)
  • online_video_hours → marks (~0.1)

Step 3 – Best ways to visualize correlations

Python

Alternative: smaller focused view (only important columns)

Python

Step 4 – Most common correlation methods in pandas

Python

When to choose which?

Method Best for Sensitive to outliers? Assumes linear?
Pearson Continuous data, linear relationships Yes Yes
Spearman Ordinal data, non-linear but monotonic Less No
Kendall Small samples, ordinal data Less No

Step 5 – Quick practical questions you can answer with correlation

Python

Step 6 – Important warnings & common mistakes

  1. Correlation ≠ Causation → High correlation does not mean one causes the other → Example: Ice cream sales and drowning deaths are positively correlated — both go up in summer (third variable: temperature)
  2. Outliers can destroy correlation Try this experiment:
Python
  1. Non-linear relationships are missed by Pearson (example: quadratic relationship, U-shape)
  2. Too many columns = messy heatmap → Always select only meaningful columns first

Your turn — small homework exercises

  1. Calculate correlation between part_time_job_hours and all marks
  2. Find the two variables that have the strongest negative correlation
  3. Create a correlation heatmap only for columns related to marks and study factors
  4. Add a new column revision_hours = study_hours_per_week * 0.6 + random noise → Check how strongly it correlates with marks

Try these — and feel free to share your code/output or ask what went wrong.

Where do you want to go next?

  • Scatter plots to visually understand correlations
  • How to test significance of correlation (p-value)
  • Partial correlation (control for third variable)
  • Correlation vs Causation real-life examples
  • Dealing with missing values when calculating correlation
  • When correlation is misleading (classic traps)

Just tell me which direction you want to explore next — we’ll go slowly and deeply with examples. 😊

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *