Chapter 15: Plotting
plotting — written as if we are sitting together in front of a screen, I’m showing every line of code, explaining why we do things this way, what common mistakes people make, and how real people actually create useful plots in data analysis in 2025–2026.
Let’s go slowly and realistically.
Step 0 – Mindset before we start plotting
Good plots are not about beauty first — they are about answering a question clearly.
Before writing any .plot() code, always ask yourself:
- What question am I trying to answer?
- Who is looking at this plot? (me / team / boss / presentation / report)
- What is the most important message I want to jump out?
Common goals:
- See trend over time → line plot
- Compare categories → bar plot
- See relationship between two numbers → scatter plot
- See distribution → histogram / boxplot
- See proportions → pie (carefully!) or stacked bar
Step 1 – Prepare a realistic dataset
We’ll use a small but realistic sales / student performance dataset.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # we'll use it later for nicer looks # Set random seed so you get same numbers np.random.seed(42) dates = pd.date_range(start='2025-01-01', periods=120, freq='D') sales = pd.DataFrame({ 'date': dates, 'region': np.random.choice(['North', 'South', 'East', 'West'], size=120), 'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Headphones'], size=120), 'units_sold': np.random.randint(5, 85, 120), 'revenue': np.random.uniform(4000, 62000, 120).round(2), 'discount_%': np.random.uniform(0, 25, 120).round(1), 'customer_rating': np.random.uniform(2.8, 4.9, 120).round(1) }) # Add some realistic patterns sales['revenue'] = sales['units_sold'] * np.random.uniform(450, 1250, 120) sales['revenue'] = sales['revenue'].round(2) # Show first few rows print(sales.head(8)) sales.info() |
Step 2 – The absolute simplest plot in pandas
|
0 1 2 3 4 5 6 7 8 |
# Quickest way — one line sales['units_sold'].plot(title="Daily Units Sold – Raw") plt.show() |
What we see: A messy line because dates are not sorted properly and there is no meaningful order yet.
Step 3 – Most useful first real plot: Time series line plot
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# Better: set date as index first sales = sales.set_index('date').sort_index() # Now plot plt.figure(figsize=(12, 5)) sales['units_sold'].plot( color='teal', linewidth=1.8, marker='o', markersize=4, linestyle='-', title='Daily Units Sold – Jan to Apr 2025', xlabel='Date', ylabel='Units Sold' ) plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() |
Step 4 – Grouped line plot – compare regions
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
plt.figure(figsize=(14, 6)) # Pivot so each region gets its own line sales.pivot_table( values='revenue', index='date', columns='region', aggfunc='sum' ).plot( linewidth=2.2, marker='o', markersize=5, alpha=0.9 ) plt.title('Revenue Trend by Region – 2025 Q1', fontsize=14, pad=15) plt.ylabel('Total Revenue (₹)', fontsize=12) plt.xlabel('Date', fontsize=12) plt.legend(title='Region', bbox_to_anchor=(1.02, 1), loc='upper left') plt.grid(True, alpha=0.25) plt.tight_layout() plt.show() |
Step 5 – Bar plot – compare categories
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# Total revenue by product product_sales = sales.groupby('product')['revenue'].sum().sort_values(ascending=False) plt.figure(figsize=(10, 6)) product_sales.plot( kind='bar', color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'], edgecolor='black', linewidth=1.1 ) plt.title('Total Revenue by Product – Q1 2025', fontsize=14) plt.ylabel('Revenue (₹)', fontsize=12) plt.xlabel('Product', fontsize=12) plt.xticks(rotation=0, fontsize=11) plt.grid(axis='y', alpha=0.3) # Add value labels on bars for i, v in enumerate(product_sales): plt.text(i, v + 5000, f'₹{v:,.0f}', ha='center', fontsize=10) plt.tight_layout() plt.show() |
Step 6 – Scatter plot – relationship between two variables
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
plt.figure(figsize=(10, 7)) plt.scatter( sales['customer_rating'], sales['discount_%'], s=sales['units_sold']*3, # size by units sold alpha=0.6, c=sales['revenue']/1000, # color by revenue cmap='viridis', edgecolors='black', linewidth=0.5 ) plt.colorbar(label='Revenue (thousands ₹)') plt.title('Customer Rating vs Discount % – bubble size = units sold', fontsize=13) plt.xlabel('Average Customer Rating', fontsize=12) plt.ylabel('Discount %', fontsize=12) plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() |
Step 7 – Histogram & KDE – distribution of one variable
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
plt.figure(figsize=(11, 5)) # Side by side plt.subplot(1, 2, 1) sales['customer_rating'].plot( kind='hist', bins=15, color='skyblue', edgecolor='black', title='Distribution of Customer Ratings' ) plt.subplot(1, 2, 2) sns.kdeplot( data=sales, x='customer_rating', fill=True, color='purple', alpha=0.4 ) plt.title('Kernel Density Estimate – Customer Ratings') plt.tight_layout() plt.show() |
Step 8 – Boxplot – compare distributions across categories
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
plt.figure(figsize=(10, 6)) sns.boxplot( data=sales, x='region', y='customer_rating', hue='product', palette='Set2', width=0.7 ) plt.title('Customer Rating Distribution by Region & Product', fontsize=13) plt.ylabel('Customer Rating', fontsize=12) plt.xlabel('Region', fontsize=12) plt.legend(title='Product', bbox_to_anchor=(1.02, 1), loc='upper left') plt.grid(True, axis='y', alpha=0.3) plt.tight_layout() plt.show() |
Step 9 – Quick reference – most common plot types in pandas
| Goal | Code example | Best when… |
|---|---|---|
| Time series trend | df[‘col’].plot() | Data has datetime index |
| Compare categories | df.groupby(‘cat’)[‘val’].sum().plot.bar() | Few categories (≤ 10–12) |
| Relationship 2 variables | df.plot.scatter(x=’a’, y=’b’) | Looking for correlation/pattern |
| Distribution / shape | df[‘col’].plot.hist(bins=20) | Understand spread & shape |
| Compare distributions | sns.boxplot(x=’cat’, y=’num’, data=df) | Many groups, want median/outliers |
| Correlation heatmap | sns.heatmap(df.corr(), annot=True, cmap=’coolwarm’) | Many numeric variables |
Step 10 – Your turn – small practice tasks
Try these on the sales DataFrame:
- Plot monthly total revenue (hint: resample(‘ME’))
- Create a bar plot showing average customer rating by product
- Make a scatter plot of units_sold vs revenue, colored by discount_%
- Show boxplots of revenue by region
Which one would you like to try first? Or tell me what kind of plot you want to create with your own data — I’ll guide you step by step.
Where do you want to go next?
- Styling plots better (titles, legends, colors, themes)
- Subplots – multiple plots in one figure
- Saving plots (png, pdf, high resolution)
- Plotly or Seaborn advanced plots
- Common mistakes & how to avoid ugly plots
Just say the word — we’ll continue slowly and practically. 😊
