Chapter 1: Pandas Introduction
Pandas Introduction – What it really is and why people love it
Let me start with the most honest sentence:
Pandas is the library that turned Python into the #1 language for data work in the world.
Before pandas (around 2010–2012), people mostly used Excel, R, MATLAB, SPSS, or wrote painful loops in pure Python/NumPy. Pandas changed everything by bringing Excel-like thinking + database-like power + Python flexibility into one place.
What is Pandas, really? (non-technical explanation)
Think of pandas as:
- A super-smart Excel inside Python
- A very fast SQL table you can manipulate with Python code
- A place where whole columns can be calculated instantly (no loops needed)
Two most important objects in pandas:
| Name | Analogy | What it is |
|---|---|---|
| Series | One column in Excel | A single column of data + a label/index |
| DataFrame | Entire Excel sheet / table | Many Series side by side (with same index) |
Almost everything you do in pandas is about DataFrames.
1. First code – Let’s create our very first table
|
0 1 2 3 4 5 6 7 |
# Almost everyone starts pandas like this import pandas as pd |
Now let’s make a small table of students (most common first example):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Way #1 – from a dictionary (very clean & common) students = pd.DataFrame({ 'name': ['Priya', 'Rahul', 'Ananya', 'Karan', 'Sneha'], 'age': [21, 19, 24, 22, 20], 'city': ['Pune', 'Hyderabad', 'Bangalore', 'Mumbai', 'Chennai'], 'marks': [82, 91, 68, 76, 89], 'passed': [True, True, False, True, True] }) # Let's see it! print(students) |
You will see something like this:
|
0 1 2 3 4 5 6 7 8 9 10 11 |
name age city marks passed 0 Priya 21 Pune 82 True 1 Rahul 19 Hyderabad 91 True 2 Ananya 24 Bangalore 68 False 3 Karan 22 Mumbai 76 True 4 Sneha 20 Chennai 89 True |
This is a DataFrame — our main working object.
2. The most important first commands you should know (day 1)
When you open any new data, good analysts always do these:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# How big is the table? print(students.shape) # (5 rows, 5 columns) # What are the column names? print(students.columns) # → Index(['name', 'age', 'city', 'marks', 'passed'], dtype='object') # What kind of data is in each column? print(students.dtypes) # Very useful summary (types + missing values) students.info() # Quick statistics for numbers students.describe() # How many times each city appears? students['city'].value_counts() # How many unique cities? students['city'].nunique() # → 5 |
These 5–6 commands are what almost every pandas user runs first.
3. Selecting data – the 4 most important ways (you will use these forever)
| What you want | How most people write it | What you get |
|---|---|---|
| One column | students[‘marks’] | Series |
| Several columns | students[[‘name’, ‘marks’, ‘city’]] | DataFrame |
| Rows by position (0,1,2…) | students.iloc[0:3] | DataFrame |
| Rows that match a condition | students[students[‘marks’] > 80] | DataFrame |
Real-life examples (copy these patterns):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Only names and marks students[['name', 'marks']] # Only students who passed students[students['passed'] == True] # Students with marks 80 or more students[students['marks'] >= 80] # Students from South cities students[students['city'].isin(['Bangalore', 'Chennai', 'Hyderabad'])] # Students who are 22 or younger AND passed students[(students['age'] <= 22) & (students['passed'])] |
Notice: & means AND, means OR, ~ means NOT
4. Creating new columns – this is where pandas feels magical
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# Simple calculation students['marks_out_of_100'] = students['marks'] # already is students['marks_next_term'] = students['marks'] + 8 # New boolean column students['excellent'] = students['marks'] >= 90 # Using conditions (like Excel IF) import numpy as np students['grade'] = np.where(students['marks'] >= 90, 'A', np.where(students['marks'] >= 80, 'B', np.where(students['marks'] >= 70, 'C', 'D'))) # Percentage of max marks students['percent'] = (students['marks'] / 100 * 100).round(1) |
5. Sorting – very common
|
0 1 2 3 4 5 6 7 8 9 10 |
# Highest marks first students.sort_values('marks', ascending=False) # Sort by city, then by marks descending inside each city students.sort_values(['city', 'marks'], ascending=[True, False]) |
6. Quick summary – GroupBy (the most powerful idea in pandas)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# Average marks per city students.groupby('city')['marks'].mean() # Number of students + average marks per city students.groupby('city').agg( count = ('name', 'count'), avg_marks = ('marks', 'mean'), best_marks = ('marks', 'max') ).round(1) |
This pattern — groupby + agg — is used in almost every real project.
7. Very first real mini-project (what you should try today)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
import pandas as pd import numpy as np # Our small class df = pd.DataFrame({ 'name': ['Priya','Rahul','Ananya','Karan','Sneha','Vikram'], 'maths': [78,92,65,88,71,59], 'science': [82,89,72,91,68,63], 'city': ['Pune','Hyd','Blr','Mum','Chn','Del'] }) # 1. Create total marks df['total'] = df['maths'] + df['science'] # 2. Create average df['average'] = df['total'] / 2 # 3. Create rank df['rank'] = df['total'].rank(ascending=False).astype(int) # 4. Create result df['result'] = np.where(df['average'] >= 70, 'Pass', 'Fail') # 5. Sort by total marks descending df = df.sort_values('total', ascending=False) print(df) |
Try to run this small code — change the numbers, add your friends’ names, and see how it feels.
Summary – Your Day-1 Pandas Survival Kit
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import pandas as pd df = pd.DataFrame({...}) # create table df.head() # see first rows df.shape # rows × columns df.columns # column names df.dtypes # data types df.info() # overview df['column'] # select one column df[['col1','col2']] # select many columns df[df['marks'] > 80] # filter rows df.sort_values('marks', ascending=False) df['new_col'] = df['old'] * 2 # create new column df.groupby('city')['marks'].mean() # group & summarize |
Now tell me — what would you like to do next?
- Understand Series vs DataFrame much better
- Learn how to read CSV / Excel files properly
- Practice filtering with many real examples
- Start using GroupBy seriously
- Work with missing values (NaN)
- Try your first real messy dataset together
Just say which direction feels most exciting or useful for you right now. I’ll go slowly and deeply with you! 😊
