Chapter 11: Cleaning Wrong Data

Fixing Wrong Data” in pandas

Imagine we are sitting together. I’m showing you a real dataset that came from a messy source (Excel sheets filled by different people, API responses, manual entries, etc.). We will go through it slowly, step by step, explaining why something is wrong, how to detect it, and what realistic decisions people actually make when fixing wrong data.

What do we mean by “wrong data”?

“Wrong data” usually means one or more of these problems:

  • Impossible / unrealistic values (age = 999, salary = -45000, rating = 12.7 when scale is 1–5)
  • Typos and copy-paste errors (city = “Banglore”, “Hydrabad”, “Punee”)
  • Swapped columns / misplaced values (age written in the salary column)
  • Out-of-range values (quantity = -3, discount = 150%)
  • Contradictory information (status = “Delivered” but delivery_date is empty or in the future)
  • Obvious unit errors (weight in kg but values like 15000 → probably grams)
  • Duplicate logical meaning (“Male” / “M” / “male” / “MALE” in gender column)

Realistic example — messy employee / sales dataset

Python

This dataset contains many typical wrong-data problems.

Step-by-step realistic fixing process

1. Always start with quick detection

Python

2. Fix impossible / unrealistic numeric values

Common realistic strategies

Situation Popular realistic fix Code example
Negative salary → 0 or median / drop row df[‘salary’] = df[‘salary’].clip(lower=0)
Unrealistic high value → cap at reasonable max (99th percentile) df[‘salary’] = df[‘salary’].clip(upper=df[‘salary’].quantile(0.99))
Age > 100 or < 18 → median age or NaN df[‘age’] = np.where(df[‘age’].between(18,100), df[‘age’], np.nan)
Rating outside 1–5 → clip or NaN df[‘rating’] = df[‘rating’].clip(1,5)

Example — salary & age

Python

3. Fix typos & inconsistencies in categorical columns

city

Python

status

Python

4. Fix logical inconsistencies

Example: people who joined in the future

Python

5. Detect & fix swapped / misplaced values (harder case)

Example: age and salary look swapped for some rows

Python

6. Manager_id that doesn’t exist

Python

7. Final validation round

Python

Realistic decision framework – what most people actually do

Type of wrong data Most common action (in practice)
Negative / impossible numbers Clip, replace with median, or set to NaN
Extreme outliers Cap at 95th/99th percentile or investigate one by one
Typo in categories Mapping dictionary + str.lower() / str.title()
Future dates / impossible dates Set to NaT or median date
Swapped columns Heuristic conditions + manual check for small data
Invalid foreign keys Set to NaN or most common value
Contradictory rows Business rule (e.g. status wins over date)

Your turn — small fixing exercise

Try to fix these wrong values:

Python

Goal:

  • age → clip to 18–100, fill wrong with median
  • city → standardize & title case
  • rating → force 1–5 range

Write your code — then we can discuss what works best in different situations.

Where do you want to continue?

  • More advanced outlier detection methods
  • Fixing duplicates (logical & exact)
  • Dealing with inconsistent units (kg vs grams, INR vs USD)
  • Fuzzy matching for typos in names/cities
  • Manual review patterns for small but critical errors

Just tell me which direction you want to go next. I’ll explain slowly with realistic examples. 😊

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *