Pandas Home
Pandas Home written as if we are sitting together, I’m showing you my screen, explaining slowly, giving real examples, and telling you why we do things this way (not just the code).
Let’s imagine this is our first serious Pandas session together — we want you to really understand and feel comfortable.
Pandas – Realistic, Practical, Human-style Full Tutorial
(2025–2026 way of thinking & writing code)
0. First important mindset change
Pandas is not Excel with Python syntax. It is a completely different way of thinking:
- Columns are the most important thing (you almost always work with whole columns)
- Index is like row labels — sometimes numbers, sometimes dates, sometimes names
- Most magic happens with vectorized operations (doing the same thing to millions of rows instantly)
So we almost never write loops like this:
|
0 1 2 3 4 5 6 7 |
for i in range(len(df)): df.iloc[i, 3] = df.iloc[i, 2] * 1.1 |
We write:
|
0 1 2 3 4 5 6 |
df['bonus'] = df['salary'] * 1.1 |
That difference is huge.
1. Let’s create our first realistic table together
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
import pandas as pd import numpy as np # === Real-world-ish employee data === employees = pd.DataFrame({ 'emp_id': [101, 102, 103, 104, 105, 106, 107], 'name': ['Priya', 'Rahul', 'Aniket', 'Sneha', 'Vikram', 'Meera', 'Arjun'], 'department':['HR', 'IT', 'Sales', 'IT', 'Finance', 'Sales', 'HR'], 'city': ['Pune', 'Hyderabad', 'Bangalore', 'Hyderabad', 'Mumbai', 'Pune', 'Delhi'], 'join_date': ['2022-03-15', '2021-11-01', '2023-06-10', '2020-09-22', '2019-05-05', '2022-01-18', '2024-02-01'], 'salary': [62000, 148000, 92000, 135000, 210000, 88000, 68000], 'active': [True, True, True, True, True, False, True], 'rating': [4.1, 3.8, 4.6, 4.9, 3.2, 4.0, np.nan] }) # Make join_date a real datetime employees['join_date'] = pd.to_datetime(employees['join_date']) print(employees) |
Let’s look at it:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
emp_id name department city join_date salary active rating 0 101 Priya HR Pune 2022-03-15 00:00:00 62000 True 4.1 1 102 Rahul IT Hyderabad 2021-11-01 00:00:00 148000 True 3.8 2 103 Aniket Sales Bangalore 2023-06-10 00:00:00 92000 True 4.6 3 104 Sneha IT Hyderabad 2020-09-22 00:00:00 135000 True 4.9 4 105 Vikram Finance Mumbai 2019-05-05 00:00:00 210000 True 3.2 5 106 Meera Sales Pune 2022-01-18 00:00:00 88000 False 4.0 6 107 Arjun HR Delhi 2024-02-01 00:00:00 68000 True NaN |
2. First things we always do when we open a new table
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# 1. Shape – how big is it? print(employees.shape) # (7, 8) # 2. Column names and types print(employees.dtypes) # 3. Quick overview (this is the most useful command) employees.info() # 4. Basic statistics for numbers employees.describe() # 5. For categorical/string columns employees.describe(include='object') # 6. How many unique values per column? employees.nunique() # 7. Count of each value in a column (very common!) employees['department'].value_counts() employees['city'].value_counts(normalize=True) * 100 # percentages |
3. Selecting data – the four main patterns you will use forever
| What you want | Most common way to write it | Returns |
|---|---|---|
| One column | employees[‘salary’] | Series |
| Multiple columns | employees[[‘name’,’salary’,’city’]] | DataFrame |
| Rows by position (0,1,2…) | employees.iloc[0:3] | DataFrame |
| Rows by condition | employees[employees[‘salary’] > 100000] | DataFrame |
| Rows + specific columns | employees.loc[employees[‘salary’] > 100000, [‘name’,’city’]] | DataFrame |
| Using index labels (rare at beginning) | employees.loc[101] (if index was emp_id) | — |
Realistic examples you will write every day:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# 1. All IT department employees employees[employees['department'] == 'IT'] # 2. Salary > 1 lakh and active employees[(employees['salary'] > 100000) & (employees['active'])] # 3. People in Pune or Hyderabad employees[employees['city'].isin(['Pune', 'Hyderabad'])] # 4. Joined in year 2022 or later employees[employees['join_date'].dt.year >= 2022] # 5. Name contains 'a' or 'A' (case insensitive) employees[employees['name'].str.contains('a', case=False)] # 6. Highest salary first, only name + salary + dept employees[['name','salary','department']].sort_values('salary', ascending=False) |
4. Creating & changing columns (this is where pandas becomes powerful)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# 1. Simple math employees['annual_salary'] = employees['salary'] * 12 # 2. Boolean flags employees['is_senior'] = employees['salary'] >= 120000 employees['high_performer'] = employees['rating'] >= 4.5 # 3. Using np.where (like Excel IF) employees['salary_level'] = np.where(employees['salary'] >= 150000, 'High', np.where(employees['salary'] >= 90000, 'Medium', 'Junior')) # 4. Using apply + custom function (more complex logic) def experience_level(date): years = (pd.Timestamp.now() - date).days / 365.25 if years >= 5: return 'Senior' if years >= 2: return 'Mid' return 'Junior' employees['exp_level'] = employees['join_date'].apply(experience_level) # 5. Using groupby + transform (very powerful pattern) employees['avg_salary_in_dept'] = employees.groupby('department')['salary'].transform('mean') employees['above_dept_avg'] = employees['salary'] > employees['avg_salary_in_dept'] |
5. Missing values – what people actually do
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# Check missing employees.isna().sum() # Realistic options: # Option A: Fill with fixed value (common but dangerous) employees['rating'] = employees['rating'].fillna(3.0) # Option B: Fill with median / mean of group employees['rating'] = employees.groupby('department')['rating'].transform(lambda x: x.fillna(x.median())) # Option C: Drop only if critical employees_clean = employees.dropna(subset=['rating']) # Option D: Mark as missing category employees['rating_category'] = employees['rating'].fillna('Missing') |
6. GroupBy – the real heart of analysis
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# Simplest employees.groupby('department')['salary'].mean().round(0) # Most useful style in 2025 summary = employees.groupby('department').agg( headcount = ('emp_id', 'count'), avg_salary = ('salary', 'mean'), median_salary = ('salary', 'median'), highest_salary = ('salary', 'max'), avg_rating = ('rating', 'mean'), active_count = ('active', 'sum'), active_percent = ('active', 'mean') ).round(2) print(summary) |
Very common advanced patterns:
|
0 1 2 3 4 5 6 7 8 9 10 11 |
# Rank inside each department employees['salary_rank_in_dept'] = employees.groupby('department')['salary'].rank(ascending=False) # Cumulative salary per department (ordered by join date) employees = employees.sort_values('join_date') employees['cumulative_salary'] = employees.groupby('department')['salary'].cumsum() |
7. Merging – connecting tables (very common task)
Let’s create a second small table:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
departments = pd.DataFrame({ 'department': ['HR', 'IT', 'Sales', 'Finance', 'Marketing'], 'manager': ['Neha', 'Suresh', 'Rohan', 'Kavita', 'Amit'], 'budget_m': [4.2, 18.7, 9.1, 5.6, 3.8] }) # Most common join: left join full_data = pd.merge( employees, departments, on='department', how='left' # keep all employees, even if dept not in departments table ) |
Quick survival reference – 80% of your daily pandas life
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
df.head(8) df.tail() df.shape df.columns df.dtypes df.info() df.describe() df['col'].value_counts() df['col'].nunique() df[df['salary'] > 100000] df[(df['city']=='Pune') & (df['active'])] df[['name','salary','city']].sort_values('salary', ascending=False) df['new'] = df['old'] * 1.1 df['flag'] = df['col'] >= 100 df.groupby('city')['salary'].mean() df.groupby('dept').agg({'salary':['mean','max'], 'emp_id':'count'}) pd.merge(df1, df2, on='id', how='left') df['date'] = pd.to_datetime(df['date_str']) df['month'] = df['date'].dt.to_period('M') df['year'] = df['date'].dt.year |
Now tell me — what would you like to do next?
- Clean a messy real CSV together (very common task)
- Go much deeper into GroupBy + pivot_table + crosstab
- Work seriously with dates & time series (very important)
- Merge & concatenate multiple files / tables
- Create beautiful summary tables (formatting, rounding, conditional styling)
- Talk about common mistakes & debugging pandas code
Pick one direction — I’ll go as deep as you want with examples. 😊
