Pandas Tutorial – From Zero to Confidently Useful
0. What is Pandas, really?
Pandas = Excel + SQL + Python superpowers in one library
Main things you use every day:
- DataFrame → like a table / spreadsheet
- Series → like one column (or one row sometimes)
Most common import line in the world:
|
0 1 2 3 4 5 6 |
import pandas as pd |
(Everyone uses pd as the alias — just do it)
1. How to create a DataFrame (5 most common ways)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# Way 1: From dictionary (most common when you have clean data) df = pd.DataFrame({ 'name': ['Ananya', 'Rahul', 'Priya', 'Vikram', 'Sneha'], 'age': [23, 31, 19, 45, 28], 'city': ['Bangalore', 'Hyderabad', 'Pune', 'Delhi', 'Chennai'], 'salary': [72000, 145000, 0, 280000, 92000] }) # Way 2: From list of dictionaries data = [ {'name':'Ananya', 'age':23, 'city':'Bangalore'}, {'name':'Rahul', 'age':31, 'city':'Hyderabad'}, {'name':'Priya', 'age':19, 'city':'Pune'} ] df = pd.DataFrame(data) # Way 3: From CSV file (very common) # df = pd.read_csv('sales_data.csv') # Way 4: Empty and then fill df = pd.DataFrame(columns=['id', 'product', 'price']) df.loc[0] = [101, 'Laptop', 74999] |
Quick check:
|
0 1 2 3 4 5 6 7 8 9 10 11 |
df.head() # first 5 rows df.tail(3) # last 3 rows df.shape # (rows, columns) df.columns # list of column names df.info() # data types + missing values df.describe() # numeric summary (count, mean, std, min, max...) |
2. Selecting data – the 4 most important ways
| You want | Syntax | Result type |
|---|---|---|
| One column | df[‘salary’] | Series |
| Multiple columns | df[[‘name’,’age’]] | DataFrame |
| Rows by position | df.iloc[0:3] | DataFrame |
| Rows by condition | df[df[‘age’] > 25] | DataFrame |
| Rows + specific columns | df.loc[df[‘age’] > 25, [‘name’,’city’]] | DataFrame |
Very common patterns you will write 1000 times:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# People above 30 in Bangalore or Hyderabad df[(df['age'] > 30) & (df['city'].isin(['Bangalore', 'Hyderabad']))] # Salary above average df[df['salary'] > df['salary'].mean()] # Only names and salary, sorted by salary descending df[['name','salary']].sort_values('salary', ascending=False) |
3. Filtering – the cheat sheet people actually use
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# Equal / not equal df[df['city'] == 'Pune'] df[df['city'] != 'Delhi'] # Greater / less df[df['age'] >= 30] # Multiple conditions → use & | ~ (not) df[(df['age'] >= 25) & (df['salary'] > 80000)] # OR condition df[(df['city'] == 'Pune') | (df['city'] == 'Chennai')] # isin (very useful!) df[df['city'].isin(['Bangalore', 'Hyderabad', 'Chennai'])] # str methods (super powerful) df[df['name'].str.startswith('A')] df[df['city'].str.contains('bad', case=False)] # Hyderabad, Ahmedabad... |
4. Adding / Changing columns
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# New column (simple calculation) df['age_next_year'] = df['age'] + 1 # Using conditions df['is_senior'] = df['age'] >= 35 # Using np.where (like Excel IF) import numpy as np df['category'] = np.where(df['salary'] >= 150000, 'High', np.where(df['salary'] >= 70000, 'Medium', 'Low')) # Apply custom function def salary_level(s): if s > 200000: return 'Very High' if s > 100000: return 'High' return 'Normal' df['salary_level'] = df['salary'].apply(salary_level) |
5. Missing values – how everyone actually handles them
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# See missing count per column df.isna().sum() # Drop rows with any missing value df.dropna() # Fill missing values df['salary'] = df['salary'].fillna(0) # dangerous but common df['city'] = df['city'].fillna('Unknown') df['age'] = df['age'].fillna(df['age'].median()) # smart # Fill forward / backward (good for time series) df['value'] = df['value'].fillna(method='ffill') |
6. GroupBy – the real power of pandas
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# Average salary per city df.groupby('city')['salary'].mean() # Multiple aggregations df.groupby('city').agg({ 'salary': ['mean', 'max', 'count'], 'age': 'mean' }) # Nicer looking result df.groupby('city').agg( avg_salary = ('salary', 'mean'), max_salary = ('salary', 'max'), people = ('name', 'count') ).round(1) |
Popular groupby patterns:
|
0 1 2 3 4 5 6 7 8 9 10 |
# Rank within group df['salary_rank_in_city'] = df.groupby('city')['salary'].rank(ascending=False) # Cumulative sum df['cumulative_salary'] = df.groupby('city')['salary'].cumsum() |
7. Merging / Joining tables (like VLOOKUP / SQL JOIN)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# left join (most common) merged = pd.merge( left = employees, right = departments, how = 'left', on = 'dept_id' ) # Different column names pd.merge(employees, departments, left_on='department_id', right_on='id') # Multiple keys pd.merge(sales, targets, on=['region', 'month'], how='left') |
Quick mnemonic:
| how= | Keeps | Like SQL |
|---|---|---|
| left | all from left | LEFT JOIN |
| right | all from right | RIGHT JOIN |
| inner | only matching | INNER JOIN |
| outer | everything + NaN where missing | FULL OUTER |
8. Quick real-world mini-project example
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
import pandas as pd # Suppose you have sales data df = pd.read_csv('sales.csv') # Clean & prepare df['order_date'] = pd.to_datetime(df['order_date']) df['month'] = df['order_date'].dt.to_period('M') df['profit'] = df['sales'] - df['cost'] # Analysis summary = df.groupby(['region', 'month']).agg({ 'sales': 'sum', 'profit': 'sum', 'order_id': 'count' }).rename(columns={'order_id': 'orders'}) # Top 5 profitable months overall best_months = df.groupby('month')['profit'].sum().nlargest(5) # Profitable regions profitable_regions = df.groupby('region')['profit'].sum().sort_values(ascending=False) print("Top months by profit:") print(best_months) print("\nProfit by region:") print(profitable_regions) |
Quick Reference Table – 80/20 Pandas
| Task | Code example |
|---|---|
| Read CSV | pd.read_csv(‘file.csv’) |
| Filter rows | df[df[‘age’] > 30] |
| Select columns | df[[‘name’,’salary’]] |
| Sort | df.sort_values(‘salary’, ascending=False) |
| New column | df[‘bonus’] = df[‘salary’] * 0.1 |
| Group & aggregate | df.groupby(‘city’)[‘salary’].mean() |
| Merge tables | pd.merge(df1, df2, on=’id’, how=’left’) |
| Handle missing | df.fillna(0) or df.dropna() |
| String contains | df[df[‘name’].str.contains(‘Rah’)] |
| Date part | df[‘month’] = df[‘date’].dt.month |
What would you like to practice next?
- Reading & cleaning real messy CSV
- GroupBy + Pivot tables in detail
- Working with dates & time series
- Merging multiple files together
- Creating beautiful summary tables
Just tell me which direction you want to go deeper! 😄
