Chapter 9: Cleaning Empty Cells
How to clean / handle empty cells (missing values) in pandas.
Imagine we are sitting together and I’m showing you real data on my screen. We will go slowly, understand why things are missing, see what options we really have, and learn the patterns most people actually use in real projects.
1. First — what do we actually mean by “empty cells”?
In pandas, “empty” usually means one of these:
| What you see | Internal value in pandas | Name | isna() returns True? |
|---|---|---|---|
| (nothing) | NaN | Not a Number (float) | Yes |
| None | None | Python None | Yes |
| <NA> | pd.NA | Nullable integer/string | Yes |
| empty string “” | “” | Empty string | No |
| “NA”, “N/A”, “-“, “null”, “missing” | string | Text that means missing | No |
Important takeaway: NaN, None, and pd.NA are considered missing by pandas (isna() / isnull()). Empty strings and words like “NA” / “-” are not missing — they are normal strings.
2. Let’s create a realistic messy table with different kinds of “empties”
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import pandas as pd import numpy as np data = { 'name': ['Priya', 'Rahul', '', 'Sneha', 'Vikram', 'Meera', np.nan, 'Arjun', 'Neha', 'Karan'], 'age': [28, 34, np.nan, 25, 41, np.nan, 29, 33, None, 27], 'city': ['Pune', 'Hyderabad', 'Bangalore', 'Pune', '', 'Mumbai', 'Delhi', 'Bangalore', 'N/A', 'Chennai'], 'salary': [72000, 145000, 98000, np.nan, 210000, 88000, '-', 112000, 95000, 68000], 'rating': [4.1, 3.8, 4.6, 4.9, np.nan, 4.0, 3.5, 4.2, 4.7, np.nan], 'joined': ['2022-03-15', '2021-11-01', '', '2020-09-22', '2019-05-05', np.nan, '2023-07-10', '2024-02-01', '2021-08-14', '2022-12-30'], 'active': [True, True, False, True, True, False, np.nan, True, True, True] } df = pd.DataFrame(data) # Let's see it df |
3. Step 1 – Always start by finding out where the missing values are
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Most important commands — run these first! print("Missing count per column:") print(df.isna().sum()) print("\nTotal missing cells:", df.isna().sum().sum()) print("\nRows that have at least one missing value:") print(df[df.isna().any(axis=1)]) print("\nPercentage missing per column:") print((df.isna().mean() * 100).round(2), "%") |
Common output for our table:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
name 1 age 2 city 2 salary 1 rating 2 joined 2 active 1 dtype: int64 |
4. The 7 most realistic ways people handle missing values
| # | Method | When people use it | Code example | Destructive? |
|---|---|---|---|---|
| 1 | Drop rows | Very few missing, row is useless without data | df.dropna() | Yes |
| 2 | Drop columns | Column is almost all missing | df.dropna(axis=1, thresh=…) | Yes |
| 3 | Fill with fixed value | You know what missing should mean (0, ‘Unknown’) | df[‘city’] = df[‘city’].fillna(‘Unknown’) | No |
| 4 | Fill with mean / median | Numeric column, missing looks random | df[‘salary’] = df[‘salary’].fillna(df[‘salary’].median()) | No |
| 5 | Fill with group average | Missing depends on category (dept, city, …) | Groupby + transform | No |
| 6 | Forward / backward fill | Time series, last known value is reasonable | df[‘rating’].fillna(method=’ffill’) | No |
| 7 | Leave as is / mark explicitly | You want to keep info that value was missing | df[‘salary_missing’] = df[‘salary’].isna() | No |
5. Realistic cleaning walkthrough — column by column
Column: name
|
0 1 2 3 4 5 6 7 8 9 10 |
# Missing name → replace with 'Unknown Customer' df['name'] = df['name'].fillna('Unknown Customer') # Also replace empty string df['name'] = df['name'].replace('', 'Unknown Customer') |
Column: age
|
0 1 2 3 4 5 6 7 8 |
# Two missing ages — let's use median (more robust than mean) median_age = df['age'].median() # 29.0 df['age'] = df['age'].fillna(median_age) |
Column: city
|
0 1 2 3 4 5 6 7 8 9 10 |
# Missing & 'N/A' → both mean unknown df['city'] = df['city'].replace(['', 'N/A', 'NA'], 'Unknown') # or more safely: df['city'] = df['city'].replace(['', 'N/A', 'NA', np.nan], 'Unknown') |
Column: salary (very common situation!)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# First — fix the '-' which is not numeric df['salary'] = df['salary'].replace('-', np.nan) # Now it's numeric with NaN df['salary'] = pd.to_numeric(df['salary'], errors='coerce') # Strategy: fill with median salary of their city (if known) df['salary'] = df.groupby('city')['salary'].transform(lambda x: x.fillna(x.median())) # If still missing (unknown city) → global median df['salary'] = df['salary'].fillna(df['salary'].median()) |
Column: rating
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Ratings are often left missing on purpose (not rated yet) # Popular choices: # A) Leave as NaN # B) Fill with 3.0 or company average # C) Fill forward/backward if time ordered # Let's do group average by city (realistic choice) df['rating'] = df.groupby('city')['rating'].transform(lambda x: x.fillna(x.mean().round(1))) # Last missing values → overall average df['rating'] = df['rating'].fillna(df['rating'].mean().round(1)) |
Column: joined (date)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# First convert to datetime df['joined'] = pd.to_datetime(df['joined'], errors='coerce') # Fill missing dates — common choices: # - Most frequent date # - Median date # - A fixed date (e.g. company start date) most_common_date = df['joined'].mode()[0] df['joined'] = df['joined'].fillna(most_common_date) |
Column: active
|
0 1 2 3 4 5 6 7 |
# Missing active → assume False (conservative) df['active'] = df['active'].fillna(False) |
6. Quick reference – the commands people use most
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# Find missing df.isna().sum() df.isnull().sum() # same thing df.notna().sum() # Drop df.dropna() # any missing df.dropna(how='all') # only if whole row missing df.dropna(subset=['salary', 'rating']) # only these columns matter df.dropna(thresh=5) # at least 5 non-missing values # Fill df.fillna(0) df.fillna({'salary': 0, 'city': 'Unknown', 'rating': 3.0}) df['salary'].fillna(df['salary'].median(), inplace=True) # Group-aware fill df['salary'] = df.groupby('city')['salary'].transform(lambda x: x.fillna(x.median())) # Forward / backward fill (time series) df['stock_price'] = df['stock_price'].fillna(method='ffill') # previous value df['stock_price'] = df['stock_price'].fillna(method='bfill') # next value # Mark missing as new column df['salary_was_missing'] = df['salary'].isna() |
7. Mini practice task for you
Take this small messy series:
|
0 1 2 3 4 5 6 |
s = pd.Series([28, np.nan, 34, '', 41, 'N/A', np.nan, 29, '-', 33]) |
Clean it so that:
- empty string ” → NaN
- ‘N/A’ → NaN
- ‘-‘ → NaN
- All missing values filled with median of the non-missing numbers
Try writing the code — then come back and we’ll compare.
Where do you want to go next?
- More advanced group-based imputation (KNN, regression, etc.)
- How to deal with very high percentage of missing values
- Visualizing missing values (missingno library)
- Cleaning mixed types in the same column
- Realistic strategies when you don’t know what to fill
Just tell me which direction you want to continue — I’ll keep explaining slowly with examples. 😊
