Chapter 14: Pandas Correlations

What actually is “correlation” in data analysis?

Correlation tells us how strongly two numeric variables move together and in which direction.

Common interpretations of the Pearson correlation coefficient (r):

r value What it means Real-world example
+0.90 – +1.00 Very strong positive correlation Study hours vs exam score
+0.70 – +0.89 Strong positive House size vs house price
+0.40 – +0.69 Moderate positive Advertising budget vs sales
+0.10 – +0.39 Weak positive Number of books read vs vocabulary size
~0.00 Almost no linear relationship Shoe size vs favorite color
-0.10 – -0.39 Weak negative Price of item vs number sold
-0.40 – -0.69 Moderate negative Temperature vs ice cream sales (in winter context)
-0.70 – -0.89 Strong negative Hours worked overtime vs free time
-0.90 – -1.00 Very strong negative Distance travelled vs fuel remaining

Very important reminders (say this sentence out loud 3 times):

  • Correlation ≠ Causation
  • High correlation does not mean one thing causes the other
  • A third hidden variable can create strong correlation

1. Realistic example dataset

Let’s create a small but realistic student dataset where we expect some interesting correlations.

Python

2. The most important command: .corr()

Python

You will typically see something like this (values depend on random seed):

text

3. How to read this table (most important part)

Look at the last few rows / columns — they show how everything relates to the scores.

Key observations:

  • study_hours_week has very strong positive correlation with all scores (~0.73–0.89)
  • attendance_% also has strong positive correlation (~0.62–0.74)
  • screen_time_day has strong negative correlation with scores (~ -0.48 to -0.65)
  • sleep_hours_day has almost zero correlation with academic performance
  • exercise_hours_week has weak positive correlation (small benefit visible)

4. Best ways to visualize correlations

Python

Alternative: smaller focused heatmap (often better)

Python

5. Most useful correlation patterns people use every day

Python

6. Very important warnings (repeat after me)

  • Correlation does not imply causation
  • Outliers can dramatically change correlation

Try this experiment:

Python

→ One crazy value can destroy or fake a relationship!

7. Quick exercises for you (try them now)

  1. Which variable has the strongest negative correlation with overall_%?
  2. Is there any meaningful correlation between sleep_hours_day and marks?
  3. Create a correlation table only for the three subject scores + overall
  4. Try Spearman instead of Pearson — do the values change a lot?

If you want, paste your answers or questions — we can discuss them.

Where would you like to go next?

  • Scatter plots + regression lines (visual understanding)
  • How to test if correlation is statistically significant (p-value)
  • Partial correlation (removing effect of third variable)
  • Correlation vs Causation – classic real-world mistakes
  • Correlation heatmap with conditional formatting in pandas
  • When Pearson fails – non-linear relationships examples

Just tell me which topic you want to dive into next — we’ll go slowly and deeply with examples. 😊

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *