Chapter 4: Pandas DataFrames
What is a Pandas DataFrame, really? (the most honest explanation)
A DataFrame is:
- A 2-dimensional, labeled data structure
- It is many Series put side by side (each column is a Series)
- It has both row labels (index) and column labels (column names)
- Think of it as: Excel sheet + database table + NumPy 2D array with labels
Most important mental model:
- Columns are the most important thing in pandas → You almost always work with whole columns at once → Calculations, filtering, grouping — almost everything is column-oriented
- Rows are secondary — they are identified by the index
1. Creating a DataFrame — the 5 most realistic ways
Way 1 – From a dictionary (cleanest & most common)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import pandas as pd students = pd.DataFrame({ 'name': ['Aarav', 'Diya', 'Rohan', 'Isha', 'Vihaan', 'Saanvi'], 'age': [21, 19, 22, 20, 23, 18], 'city': ['Delhi', 'Mumbai', 'Bangalore', 'Pune', 'Chennai', 'Kolkata'], 'marks': [78, 92, 65, 88, 71, 84], 'active': [True, True, False, True, True, True] }) students |
Typical output:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
name age city marks active 0 Aarav 21 Delhi 78 True 1 Diya 19 Mumbai 92 True 2 Rohan 22 Bangalore 65 False 3 Isha 20 Pune 88 True 4 Vihaan 23 Chennai 71 True 5 Saanvi 18 Kolkata 84 True |
→ Dictionary keys become column names, values become the column data
Way 2 – From list of dictionaries (very common when data comes from JSON/API)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
data = [ {'name': 'Priya', 'age': 24, 'city': 'Hyderabad', 'marks': 89}, {'name': 'Rahul', 'age': 20, 'city': 'Pune', 'marks': 76}, {'name': 'Neha', 'age': 22, 'city': 'Delhi', 'marks': 91} ] df = pd.DataFrame(data) |
Way 3 – From list of lists + column names (when data is clean arrays)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
matrix = [ ['Amit', 28, 'Bangalore', 82], ['Sneha', 25, 'Mumbai', 95], ['Karan', 31, 'Chennai', 68] ] df = pd.DataFrame(matrix, columns=['name', 'age', 'city', 'score']) |
Way 4 – Empty DataFrame (common when building incrementally)
|
0 1 2 3 4 5 6 7 8 9 |
log = pd.DataFrame(columns=['time', 'user', 'action', 'value']) # Later you can add rows log.loc[len(log)] = ['2025-02-07 14:30', 'u101', 'login', 1] |
2. The most important first things you should always check
When you get any new DataFrame, good analysts do these immediately:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
df = students.copy() # let's use our students table # 1. Size df.shape # (6, 5) → 6 rows, 5 columns # 2. Column names df.columns # 3. Data types df.dtypes # 4. The single most useful command df.info() # 5. Quick numeric overview df.describe() # 6. For text columns df.describe(include='object') # 7. First & last rows df.head(3) df.tail(2) # 8. Unique values count df.nunique() # 9. Count each value in a column (very frequent!) df['city'].value_counts() |
3. Selecting data — the four main patterns (you will use these 1000×)
| Goal | Most common syntax | Returns |
|---|---|---|
| One column | df[‘marks’] | Series |
| Multiple columns | df[[‘name’,’marks’,’city’]] | DataFrame |
| Rows by position | df.iloc[0:3] | DataFrame |
| Rows by condition | df[df[‘marks’] >= 80] | DataFrame |
| Rows + chosen columns | df.loc[df[‘marks’] >= 80, [‘name’,’city’]] | DataFrame |
Realistic everyday examples:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Only names and marks df[['name', 'marks']] # Students who scored 80+ df[df['marks'] >= 80] # Young students (≤ 20) who are active df[(df['age'] <= 20) & (df['active'])] # Students from big cities df[df['city'].isin(['Delhi', 'Mumbai', 'Bangalore'])] # Top 3 scorers — only name & marks df[['name','marks']].sort_values('marks', ascending=False).head(3) |
4. Creating & modifying columns — this is pandas’ superpower
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# 1. Simple math df['marks_next'] = df['marks'] + 5 # 2. Boolean flag df['passed'] = df['marks'] >= 70 # 3. Multiple conditions with np.where (most common pattern) import numpy as np df['grade'] = np.where(df['marks'] >= 90, 'A', np.where(df['marks'] >= 80, 'B', np.where(df['marks'] >= 70, 'C', 'D'))) # 4. Using custom function def performance(m): if m >= 90: return 'Excellent' if m >= 75: return 'Good' return 'Needs improvement' df['performance'] = df['marks'].apply(performance) # 5. Using group statistics (very powerful pattern) df['avg_marks_in_city'] = df.groupby('city')['marks'].transform('mean') df['above_city_avg'] = df['marks'] > df['avg_marks_in_city'] |
5. Index – the row labels (very important concept)
By default → 0, 1, 2, 3…
But you can change it:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Make 'name' the index df_indexed = df.set_index('name') # Now you can select by name df_indexed.loc['Diya'] # Reset back to numbers df_indexed = df_indexed.reset_index() |
Most common real use: dates, IDs, customer codes as index.
6. Quick realistic mini-project (try this yourself)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import pandas as pd import numpy as np sales = pd.DataFrame({ 'date': ['2025-01-01','2025-01-02','2025-01-03','2025-01-04'], 'product':['Laptop','Mouse','Keyboard','Monitor'], 'price': [74990, 899, 2499, 12499], 'units': [3, 15, 8, 5] }) # 1. Total revenue per product sales['revenue'] = sales['price'] * sales['units'] # 2. Discount flag sales['discount'] = sales['price'] > 10000 # 3. Rank by revenue sales['rank'] = sales['revenue'].rank(ascending=False).astype(int) # 4. Show only high-value sales sales[sales['revenue'] >= 50000][['product','revenue','rank']] |
Summary Table – Your DataFrame Survival Kit
| Task | Most common way |
|---|---|
| Create from dict | pd.DataFrame({‘col1’: […], ‘col2’: […]}) |
| See first rows | df.head(8) |
| See structure | df.info() |
| Select column | df[‘marks’] |
| Select multiple columns | df[[‘name’,’marks’]] |
| Filter rows | df[df[‘age’] > 20] |
| Filter + select columns | df.loc[df[‘marks’]>=85, [‘name’,’marks’]] |
| New column | df[‘bonus’] = df[‘salary’] * 0.1 |
| Sort descending | df.sort_values(‘marks’, ascending=False) |
| Change index | df.set_index(‘id’) |
Where do you want to go next?
- How to read CSV / Excel files properly (most common next step)
- Deeper into index and loc vs iloc
- Lots of filtering examples with complex conditions
- First serious look at groupby
- Handling missing values (NaN) in DataFrames
- Sorting, ranking, dropping duplicates in detail
Just tell me what feels most useful or interesting right now — I’ll explain slowly with realistic examples.
