Pandas Tutorial
Pandas tutorial (aimed at someone who is learning seriously and wants to really understand — not just copy-paste)
Imagine we are sitting together with a laptop, a cup of tea/coffee, and I’m explaining step by step like I would explain to my friend or a junior colleague who really wants to get good at pandas.
Pandas – Realistic, Practical, Human-friendly Guide (2025 edition)
First important sentence:
Pandas is not Excel. Pandas is not SQL. But it can do almost everything both can do — and much more — if you learn to think the pandas way.
0. Mental model – How should you picture a DataFrame?
Think of a DataFrame as:
- A table (like Excel sheet or database table)
- Every column is a Series (a named array with index)
- The index is like row labels (can be numbers 0,1,2… or dates, names, IDs…)
- Most powerful feature: you can filter, calculate, group by using the column names directly
|
0 1 2 3 4 5 6 7 |
import pandas as pd import numpy as np # we will need it very often |
1. Creating DataFrames – the 5 ways you will actually use
Way 1 – From dictionary (cleanest & most common)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
students = pd.DataFrame({ 'name': ['Ananya', 'Rahul', 'Priya', 'Arjun', 'Sneha', 'Karan'], 'age': [22, 19, 24, 31, 20, 27], 'city': ['Bangalore','Hyderabad','Pune','Delhi','Chennai','Mumbai'], 'marks': [78, 92, 65, 88, 71, 59], 'active': [True, True, False, True, True, False] }) print(students) |
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
name age city marks active 0 Ananya 22 Bangalore 78 True 1 Rahul 19 Hyderabad 92 True 2 Priya 24 Pune 65 False 3 Arjun 31 Delhi 88 True 4 Sneha 20 Chennai 71 True 5 Karan 27 Mumbai 59 False |
Way 2 – From list of records (when data comes from API / JSON)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
data = [ {'name':'Meera', 'age':26, 'city':'Kolkata', 'marks':82}, {'name':'Vikram','age':29, 'city':'Jaipur', 'marks':76}, {'name':'Neha', 'age':21, 'city':'Ahmedabad','marks':94} ] df2 = pd.DataFrame(data) |
Way 3 – Empty DataFrame and fill later (common in loops)
|
0 1 2 3 4 5 6 7 8 9 10 |
log = pd.DataFrame(columns=['timestamp', 'user', 'action', 'value']) # later... log.loc[len(log)] = ['2025-02-07 14:30', 'u123', 'login', 1] log.loc[len(log)] = ['2025-02-07 14:32', 'u456', 'purchase', 1499] |
2. Looking at data – the commands you use 50 times a day
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
students.head(3) # first 3 rows students.tail(2) # last 2 rows students.shape # (rows, columns) → (6, 5) students.columns # Index(['name', 'age', 'city', 'marks', 'active'], dtype='object') students.dtypes # data types of each column students.info() # very informative! students.describe() # only numeric columns students.describe(include='object') # for strings students['city'].value_counts() # frequency count – super useful |
3. Selecting data – the four main patterns
| Goal | Syntax example | Returns |
|---|---|---|
| One column | students[‘marks’] | Series |
| Multiple columns | students[[‘name’,’marks’,’city’]] | DataFrame |
| Rows by position (0-based) | students.iloc[1:4] | DataFrame |
| Rows by label (if index is custom) | students.loc[‘A001’] | (rare) |
| Rows by condition | students[students[‘marks’] >= 80] | DataFrame |
| Rows + chosen columns | students.loc[students[‘age’] <= 22, [‘name’,’city’]] | DataFrame |
Most common real-life filters:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Students who passed (≥ 70) and are active students[(students['marks'] >= 70) & (students['active'])] # Students from South cities or with very high marks students[ students['city'].isin(['Bangalore','Hyderabad','Chennai']) | (students['marks'] >= 90) ] # Name starts with A or a students[students['name'].str.upper().str.startswith('A')] |
4. Creating / Modifying columns (this is where pandas shines)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# Simple calculation students['marks_out_of_100'] = students['marks'] # already is students['marks_out_of_120'] = students['marks'] * 1.2 # Boolean / categorical column students['passed'] = students['marks'] >= 70 # Using numpy where (like Excel IF) students['grade'] = np.where(students['marks'] >= 90, 'A+', np.where(students['marks'] >= 80, 'A', np.where(students['marks'] >= 70, 'B', 'C'))) # Using function (more complex logic) def performance(m): if m >= 90: return 'Excellent' if m >= 75: return 'Good' if m >= 60: return 'Average' return 'Needs improvement' students['performance'] = students['marks'].apply(performance) |
5. Sorting
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Single column descending students.sort_values('marks', ascending=False) # Multiple columns students.sort_values(['city', 'marks'], ascending=[True, False]) # Keep original order of equal values stable students.sort_values('city', kind='stable') |
6. Missing values – realistic handling
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# Create some missing data for demonstration students.loc[2, 'marks'] = np.nan students.loc[5, 'age'] = np.nan # How many missing? students.isna().sum() # Common realistic choices: # 1. Drop rows (only if very few missing) students.dropna(subset=['marks']) # 2. Fill with fixed value (careful!) students['marks'] = students['marks'].fillna(50) # dangerous students['age'] = students['age'].fillna(students['age'].median()) # 3. Fill with group average (advanced but very common) students['marks'] = students.groupby('city')['marks'].transform(lambda x: x.fillna(x.mean())) |
7. GroupBy – the heart of data analysis
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Simplest students.groupby('city')['marks'].mean() # Multiple aggregations – very powerful students.groupby('city').agg( count_students = ('name', 'count'), avg_marks = ('marks', 'mean'), max_marks = ('marks', 'max'), passed_percent = ('passed', 'mean') * 100 ).round(1) # Group by two columns students.groupby(['city', 'grade'])['marks'].count() |
Very common pattern – rank inside group
|
0 1 2 3 4 5 6 |
students['rank_in_city'] = students.groupby('city')['marks'].rank(ascending=False, method='min') |
8. Merging / Joining – like VLOOKUP ×100
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
departments = pd.DataFrame({ 'city': ['Bangalore', 'Hyderabad', 'Pune', 'Delhi'], 'region': ['South', 'South', 'West', 'North'], 'avg_salary_k': [12.4, 11.1, 9.8, 14.2] }) # Most common: left join pd.merge( students, departments, on='city', how='left' ) |
9. Dates & Time – quick survival kit
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Convert string to datetime orders = pd.DataFrame({ 'order_date': ['2025-01-15', '2025-02-03', '2024-12-20', '2025-03-11'], 'amount': [3400, 1200, 8900, 4500] }) orders['order_date'] = pd.to_datetime(orders['order_date']) # Extract parts orders['month'] = orders['order_date'].dt.to_period('M') orders['day_name'] = orders['order_date'].dt.day_name() orders['quarter'] = orders['order_date'].dt.quarter orders['is_weekend']= orders['order_date'].dt.dayofweek >= 5 |
Quick Reference – Commands you will type hundreds of times
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
df.head() df.tail() df.shape df.columns df.dtypes df['col'] df[['col1','col2']] df[df['age']>25] df.loc[df['marks']>=80, ['name','marks']] df.sort_values('marks', ascending=False) df['new_col'] = df['old'] * 1.1 df['category'] = np.where( condition, 'Yes', 'No' ) df.groupby('city')['sales'].sum() df.groupby('city').agg({'sales':['sum','mean'], 'order':['count']}) pd.merge(df1, df2, on='id', how='left') df['date'] = pd.to_datetime(df['date_str']) df['month'] = df['date'].dt.to_period('M') df.isna().sum() df.fillna(0) / df.dropna() |
What would you like to do next?
- Practice cleaning a messy real-world CSV together
- Go deeper into GroupBy + pivot_table + crosstab
- Work with dates & time-series (very important in 2025)
- Learn merging & concatenating multiple files/tables
- Write beautiful summary reports (style + formatting)
- Common mistakes & how to debug pandas code
Just tell me which path you want to walk next — I’ll go as deep as you want! 😄
