Chapter 11: Cleaning Wrong Data
Fixing Wrong Data” in pandas
Imagine we are sitting together. I’m showing you a real dataset that came from a messy source (Excel sheets filled by different people, API responses, manual entries, etc.). We will go through it slowly, step by step, explaining why something is wrong, how to detect it, and what realistic decisions people actually make when fixing wrong data.
What do we mean by “wrong data”?
“Wrong data” usually means one or more of these problems:
- Impossible / unrealistic values (age = 999, salary = -45000, rating = 12.7 when scale is 1–5)
- Typos and copy-paste errors (city = “Banglore”, “Hydrabad”, “Punee”)
- Swapped columns / misplaced values (age written in the salary column)
- Out-of-range values (quantity = -3, discount = 150%)
- Contradictory information (status = “Delivered” but delivery_date is empty or in the future)
- Obvious unit errors (weight in kg but values like 15000 → probably grams)
- Duplicate logical meaning (“Male” / “M” / “male” / “MALE” in gender column)
Realistic example — messy employee / sales dataset
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
import pandas as pd import numpy as np data = { 'emp_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110], 'name': ['Priya Sharma', 'Rahul Kumar', 'Ananya Roy', 'Sneha Patil', 'Vikram', 'Meera Singh', 'Arjun', 'Neha Gupta', 'Karan Mehta', 'Riya'], 'age': [28, 999, 34, 25, 9999, 41, -5, 33, 27, 29], 'city': ['Pune', 'Hyderabad', 'Banglore', 'Pune', 'Mumbai', 'pune', 'Delhi', 'Bangalore', 'Hydrabad', 'Chennai'], 'salary': [72000, 145000, -98000, 112000, 210000, 88000, 9999999, 105000, 95000, 68000], 'rating': [4.1, 4.8, 4.6, 12.7, 3.9, 4.0, 5.5, 4.2, 4.7, 999], 'join_date': ['2022-03-15', '2021-11-01', '2023-06-10', '2020-09-22', '2019-05-05', '2022-01-18', '2024-02-01', '2021-08-14', '2023-11-20', '2025-12-01'], 'status': ['Active', 'Active', 'Left', 'active', 'ACTIVE', 'On Leave', 'active', 'Left', 'Active', 'Terminated'], 'department': ['HR', 'IT', 'Sales', 'IT', 'Finance', 'Sales', 'HR', 'Marketing', 'IT', 'Sales'], 'manager_id': [None, 101, 102, 101, 9999, 103, 101, 108, 104, 102] } df = pd.DataFrame(data) df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce') |
This dataset contains many typical wrong-data problems.
Step-by-step realistic fixing process
1. Always start with quick detection
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# Basic overview df.info() df.describe() # min / max often reveal crazy values df.describe(include='object') # Look at extremes print("Very high salaries:") print(df.nlargest(5, 'salary')[['name','salary','department']]) print("\nVery low / negative salaries:") print(df.nsmallest(5, 'salary')[['name','salary']]) print("\nImpossible ages:") print(df[df['age'] > 120][['name','age']]) print(df[df['age'] < 18][['name','age']]) print("\nImpossible ratings:") print(df[df['rating'] > 5][['name','rating']]) |
2. Fix impossible / unrealistic numeric values
Common realistic strategies
| Situation | Popular realistic fix | Code example |
|---|---|---|
| Negative salary | → 0 or median / drop row | df[‘salary’] = df[‘salary’].clip(lower=0) |
| Unrealistic high value | → cap at reasonable max (99th percentile) | df[‘salary’] = df[‘salary’].clip(upper=df[‘salary’].quantile(0.99)) |
| Age > 100 or < 18 | → median age or NaN | df[‘age’] = np.where(df[‘age’].between(18,100), df[‘age’], np.nan) |
| Rating outside 1–5 | → clip or NaN | df[‘rating’] = df[‘rating’].clip(1,5) |
Example — salary & age
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Salary: remove negatives & crazy high values df['salary'] = df['salary'].clip(lower=0) df['salary'] = df['salary'].clip(upper=df['salary'].quantile(0.98)) # 98th percentile # Age: unrealistic → set to NaN, then fill with median df['age'] = np.where(df['age'].between(18, 100), df['age'], np.nan) df['age'] = df['age'].fillna(df['age'].median()) # Rating: clip to 1–5 range df['rating'] = df['rating'].clip(lower=1, upper=5) |
3. Fix typos & inconsistencies in categorical columns
city
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Standardize city names city_mapping = { 'Banglore': 'Bangalore', 'bangalore': 'Bangalore', 'pune': 'Pune', 'Hydrabad': 'Hyderabad', 'hyderabad': 'Hyderabad' } df['city'] = df['city'].replace(city_mapping) df['city'] = df['city'].str.title() # first letter capital |
status
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
df['status'] = ( df['status'] .str.strip() .str.lower() .replace({ 'active': 'Active', 'left': 'Inactive', 'terminated': 'Inactive' }) .replace('on leave', 'On Leave') ) |
4. Fix logical inconsistencies
Example: people who joined in the future
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
today = pd.Timestamp.now() # Future join dates → probably typo → set to NaT or reasonable past date df['join_date'] = np.where( df['join_date'] > today, pd.NaT, df['join_date'] ) # Fill missing join dates with median date (common choice) df['join_date'] = df['join_date'].fillna(df['join_date'].median()) |
5. Detect & fix swapped / misplaced values (harder case)
Example: age and salary look swapped for some rows
|
0 1 2 3 4 5 6 7 8 9 10 |
# Simple heuristic: if age > 100 and salary < 100000 → probably swapped mask_swapped = (df['age'] > 150) & (df['salary'] < 100000) # Swap them back df.loc[mask_swapped, ['age', 'salary']] = df.loc[mask_swapped, ['salary', 'age']].values |
6. Manager_id that doesn’t exist
|
0 1 2 3 4 5 6 7 8 |
# Remove manager_id that don't exist in emp_id valid_managers = set(df['emp_id']) df['manager_id'] = df['manager_id'].where(df['manager_id'].isin(valid_managers), np.nan) |
7. Final validation round
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
print("After fixing:") print(df.describe()) print("\nCities:") print(df['city'].value_counts()) print("\nStatus values:") print(df['status'].value_counts()) print("\nFuture join dates:", (df['join_date'] > pd.Timestamp.now()).sum()) print("\nImpossible ages:", ((df['age'] < 18) | (df['age'] > 100)).sum()) |
Realistic decision framework – what most people actually do
| Type of wrong data | Most common action (in practice) |
|---|---|
| Negative / impossible numbers | Clip, replace with median, or set to NaN |
| Extreme outliers | Cap at 95th/99th percentile or investigate one by one |
| Typo in categories | Mapping dictionary + str.lower() / str.title() |
| Future dates / impossible dates | Set to NaT or median date |
| Swapped columns | Heuristic conditions + manual check for small data |
| Invalid foreign keys | Set to NaN or most common value |
| Contradictory rows | Business rule (e.g. status wins over date) |
Your turn — small fixing exercise
Try to fix these wrong values:
|
0 1 2 3 4 5 6 7 8 |
s = pd.Series([28, 999, 34, -2, 1500, 29, 9999, 33]) city = pd.Series(['Banglore', 'bangalore', 'Pune ', 'Mumbai', 'DELHI', 'Hydrabad', 'pune']) rating = pd.Series([4.2, 5.1, 3.9, 6.0, 2.8, 4.7, 0.5]) |
Goal:
- age → clip to 18–100, fill wrong with median
- city → standardize & title case
- rating → force 1–5 range
Write your code — then we can discuss what works best in different situations.
Where do you want to continue?
- More advanced outlier detection methods
- Fixing duplicates (logical & exact)
- Dealing with inconsistent units (kg vs grams, INR vs USD)
- Fuzzy matching for typos in names/cities
- Manual review patterns for small but critical errors
Just tell me which direction you want to go next. I’ll explain slowly with realistic examples. 😊
