Chapter 7: Pandas Analyzing Data
Analyzing DataFrames in pandas (We are sitting together — I’m explaining slowly, showing realistic examples, running small pieces of code, and telling you why we do things this way and what analysts actually do every day.)
Let’s assume we already have a DataFrame loaded and cleaned. Now the real fun begins: understanding what the data is telling us.
Starting point – our example DataFrame
Let’s work with a realistic employee / sales-like dataset:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import pandas as pd import numpy as np df = pd.DataFrame({ 'emp_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110], 'name': ['Priya', 'Rahul', 'Aniket', 'Sneha', 'Vikram', 'Meera', 'Arjun', 'Neha', 'Karan', 'Riya'], 'dept': ['HR', 'IT', 'Sales', 'IT', 'Finance', 'Sales', 'HR', 'Marketing', 'IT', 'Sales'], 'city': ['Pune', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Mumbai', 'Pune', 'Delhi', 'Bangalore', 'Chennai', 'Mumbai'], 'join_date':['2022-03-15', '2021-11-01', '2023-06-10', '2020-09-22', '2019-05-05', '2022-01-18', '2024-02-01', '2021-08-14', '2023-11-20', '2020-12-30'], 'salary': [62000, 148000, 92000, 135000, 210000, 88000, 68000, 105000, 142000, 98000], 'rating': [4.1, 3.8, 4.6, 4.9, 3.2, 4.0, np.nan, 4.3, 4.7, 3.9], 'active': [True, True, True, True, True, False, True, True, True, True] }) df['join_date'] = pd.to_datetime(df['join_date']) |
This is our playground for the whole session.
Phase 1 – Quick first look (things you should do in < 30 seconds)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# 1. Shape & columns print(df.shape) # (10, 8) print(df.columns.tolist()) # 2. First & last rows df.head(6) df.tail(3) # 3. The most important single command df.info() # 4. Basic statistics – numeric columns df.describe() # 5. Statistics for non-numeric columns df.describe(include='object') # 6. Count of unique values per column df.nunique() # 7. Frequency of categories (very useful!) df['dept'].value_counts() df['city'].value_counts(normalize=True) * 100 # percentages df['active'].value_counts(normalize=True) * 100 |
What we learn quickly from these:
- How many people, how many columns
- Any missing values? (rating has 1 missing)
- Salary range: 62k – 210k
- Most common department, city, etc.
Phase 2 – Asking simple but powerful questions
Question 1: Who earns the most / least?
|
0 1 2 3 4 5 6 7 8 9 10 |
# Top 5 highest salaries df.nlargest(5, 'salary')[['name', 'salary', 'dept', 'city']] # Bottom 3 lowest salaries df.nsmallest(3, 'salary')[['name', 'salary', 'dept']] |
Question 2: Average, median, min, max per group
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Average salary by department df.groupby('dept')['salary'].mean().round(0) # Multiple statistics at once (very common pattern) df.groupby('dept').agg( headcount = ('emp_id', 'count'), avg_salary = ('salary', 'mean'), median_salary = ('salary', 'median'), highest = ('salary', 'max'), lowest = ('salary', 'min'), avg_rating = ('rating', 'mean') ).round(1) |
Very useful version with sorting:
|
0 1 2 3 4 5 6 7 8 9 10 |
df.groupby('city').agg( people = ('name', 'count'), avg_salary = ('salary', 'mean'), best_rating = ('rating', 'max') ).sort_values('avg_salary', ascending=False).round(1) |
Question 3: How many people above / below certain thresholds?
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# How many earn more than 1 lakh? print((df['salary'] > 100000).sum(), "employees") # Percentage of people with rating >= 4.5 print((df['rating'] >= 4.5).mean() * 100, "%") # Active employees per department df.groupby('dept')['active'].sum() |
Question 4: When did people join? (time-based analysis)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# Year of joining df['join_year'] = df['join_date'].dt.year df['join_month'] = df['join_date'].dt.to_period('M') # People joined per year df['join_year'].value_counts().sort_index() # People joined per year and department df.groupby(['join_year', 'dept']).size() |
Phase 3 – Creating useful analysis columns
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# 1. Salary categories df['salary_level'] = pd.cut( df['salary'], bins=[0, 80000, 120000, 180000, np.inf], labels=['Junior', 'Mid', 'Senior', 'Lead'], include_lowest=True ) # 2. Years of experience (as of today) today = pd.Timestamp.now() df['experience_years'] = ((today - df['join_date']).dt.days / 365.25).round(1) # 3. Above or below department average df['dept_avg_salary'] = df.groupby('dept')['salary'].transform('mean') df['above_dept_avg'] = df['salary'] > df['dept_avg_salary'] # 4. Salary rank within department df['salary_rank_in_dept'] = df.groupby('dept')['salary'].rank(ascending=False, method='min') |
Now we can ask better questions:
|
0 1 2 3 4 5 6 7 |
# Who is #1 in salary in their department? df[df['salary_rank_in_dept'] == 1][['name', 'dept', 'salary', 'salary_rank_in_dept']] |
Phase 4 – Most powerful analysis patterns analysts use daily
Pattern 1: Group + multiple aggregations + sorting
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
summary = df.groupby('dept').agg( headcount = ('emp_id', 'count'), avg_salary = ('salary', 'mean'), median_salary = ('salary', 'median'), highest_salary = ('salary', 'max'), avg_rating = ('rating', 'mean').round(2), active_percent = ('active', 'mean').mul(100).round(1), most_recent_join = ('join_date', 'max') ).sort_values('avg_salary', ascending=False) print(summary) |
Pattern 2: Crosstab (like pivot table for counting)
|
0 1 2 3 4 5 6 7 8 9 10 |
# Department vs City – how many people pd.crosstab(df['dept'], df['city']) # Department vs Salary level pd.crosstab(df['dept'], df['salary_level'], margins=True) |
Pattern 3: Value counts with percentage & sorting
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Top cities by average salary city_stats = df.groupby('city').agg( count=('name','count'), avg_salary=('salary','mean') ).sort_values('avg_salary', ascending=False) city_stats['percentage'] = (city_stats['count'] / city_stats['count'].sum() * 100).round(1) print(city_stats) |
Pattern 4: Conditional aggregation (very powerful)
|
0 1 2 3 4 5 6 7 8 9 10 |
# Average salary only for active employees per dept df[df['active']].groupby('dept')['salary'].mean().round(0) # Average rating only for people with salary > 100k df[df['salary'] > 100000]['rating'].mean().round(2) |
Phase 5 – Quick checklist – what good analysis almost always includes
- Shape, info, head/tail
- Missing values → df.isna().sum()
- Value counts for categories
- Describe() for numbers
- Groupby + agg for at least 3–4 metrics
- Sorting — highest/lowest usually tell the story
- New columns that make interpretation easier (categories, flags, ranks, time parts)
- Crosstab or pivot_table when comparing two categories
- Percentages, not just counts
Your turn – small analysis exercise
Using the DataFrame above (or create similar):
- Find the average salary per city
- Show department with highest average rating
- Count how many people have salary > department average
- Show top 3 most recently joined employees with name, dept, join_date
- Create a table showing count + avg salary + % active per department
Try to write the code — then come back and we can compare & improve together.
Where do you want to go next?
- Deeper into pivot_table vs crosstab vs groupby
- Time-based analysis (monthly, quarterly, year-over-year)
- Visualization basics with pandas + matplotlib/seaborn
- Finding outliers and extreme values
- Correlation between numeric columns
- Conditional formatting in Jupyter (pretty tables)
Tell me which topic you want to explore next — I’ll go deep with examples. 😊
