Chapter 11: Cleaning Wrong Data

Fixing Wrong Data” in pandas

Imagine we are sitting together. I’m showing you a real dataset that came from a messy source (Excel sheets filled by different people, API responses, manual entries, etc.). We will go through it slowly, step by step, explaining why something is wrong, how to detect it, and what realistic decisions people actually make when fixing wrong data.

What do we mean by “wrong data”?

“Wrong data” usually means one or more of these problems:

Impossible / unrealistic values (age = 999, salary = -45000, rating = 12.7 when scale is 1–5)
Typos and copy-paste errors (city = “Banglore”, “Hydrabad”, “Punee”)
Swapped columns / misplaced values (age written in the salary column)
Out-of-range values (quantity = -3, discount = 150%)
Contradictory information (status = “Delivered” but delivery_date is empty or in the future)
Obvious unit errors (weight in kg but values like 15000 → probably grams)
Duplicate logical meaning (“Male” / “M” / “male” / “MALE” in gender column)

Realistic example — messy employee / sales dataset

Python

This dataset contains many typical wrong-data problems.

Step-by-step realistic fixing process

1. Always start with quick detection

Python

2. Fix impossible / unrealistic numeric values

Common realistic strategies

Situation	Popular realistic fix	Code example
Negative salary	→ 0 or median / drop row	df[‘salary’] = df[‘salary’].clip(lower=0)
Unrealistic high value	→ cap at reasonable max (99th percentile)	df[‘salary’] = df[‘salary’].clip(upper=df[‘salary’].quantile(0.99))
Age > 100 or < 18	→ median age or NaN	df[‘age’] = np.where(df[‘age’].between(18,100), df[‘age’], np.nan)
Rating outside 1–5	→ clip or NaN	df[‘rating’] = df[‘rating’].clip(1,5)

Example — salary & age

Python

3. Fix typos & inconsistencies in categorical columns

city

Python

status

Python

4. Fix logical inconsistencies

Example: people who joined in the future

Python

5. Detect & fix swapped / misplaced values (harder case)

Example: age and salary look swapped for some rows

Python

6. Manager_id that doesn’t exist

Python

7. Final validation round

Python

Realistic decision framework – what most people actually do

Type of wrong data	Most common action (in practice)
Negative / impossible numbers	Clip, replace with median, or set to NaN
Extreme outliers	Cap at 95th/99th percentile or investigate one by one
Typo in categories	Mapping dictionary + str.lower() / str.title()
Future dates / impossible dates	Set to NaT or median date
Swapped columns	Heuristic conditions + manual check for small data
Invalid foreign keys	Set to NaN or most common value
Contradictory rows	Business rule (e.g. status wins over date)

Your turn — small fixing exercise

Try to fix these wrong values:

Python

Goal:

age → clip to 18–100, fill wrong with median
city → standardize & title case
rating → force 1–5 range

Write your code — then we can discuss what works best in different situations.

Where do you want to continue?

More advanced outlier detection methods
Fixing duplicates (logical & exact)
Dealing with inconsistent units (kg vs grams, INR vs USD)
Fuzzy matching for typos in names/cities
Manual review patterns for small but critical errors

Just tell me which direction you want to go next. I’ll explain slowly with realistic examples. 😊

Languages

Database

Web Technologies

Wordpress Tutorial

Top Online Compilers

PHP Projects

CRUD Management
PHP Search
Blog/CMS
E-commerce Website
Event Management System
Online Learning Platform
Task Management System
Social Networking Site
Inventory Management System
Real Estate Listing Website
Job Portal
Discussion Forum
Online Quiz/Test Platform
File Sharing System
Travel Booking System
Expense Management System
Recipe Sharing Platform
Online Survey System
Library Management System
Health and Fitness Tracker
Online Marketplace

Home

About Us

Disclaimer

+91 9433 511 250

Email

info@bestwebteacher.com

Chapter 11: Cleaning Wrong Data

What do we mean by “wrong data”?

Realistic example — messy employee / sales dataset

Step-by-step realistic fixing process

1. Always start with quick detection

2. Fix impossible / unrealistic numeric values

3. Fix typos & inconsistencies in categorical columns

4. Fix logical inconsistencies

5. Detect & fix swapped / misplaced values (harder case)

6. Manager_id that doesn’t exist

7. Final validation round

Realistic decision framework – what most people actually do

Your turn — small fixing exercise

You may also like...

Leave a Reply Cancel reply

PANDAS Tutorial

Languages

Database

Web Technologies

Web Technologies

Wordpress Tutorial

Top Online Compilers

PHP Projects

WhatsApp

Email

Connect with us

Chapter 11: Cleaning Wrong Data

What do we mean by “wrong data”?

Realistic example — messy employee / sales dataset

Step-by-step realistic fixing process

1. Always start with quick detection

2. Fix impossible / unrealistic numeric values

3. Fix typos & inconsistencies in categorical columns

4. Fix logical inconsistencies

5. Detect & fix swapped / misplaced values (harder case)

6. Manager_id that doesn’t exist

7. Final validation round

Realistic decision framework – what most people actually do

Your turn — small fixing exercise

You may also like...

Chapter 22: Pandas Study Plan

Chapter 21: Pandas Syllabus

Chapter 20: Pandas Exercises

Leave a Reply Cancel reply

PANDAS Tutorial

Languages

Database

Web Technologies

Web Technologies

Wordpress Tutorial

Top Online Compilers

PHP Projects

WhatsApp

Email

Connect with us