Chapter 8: Cleaning Data

Cleaning Data in pandas (Imagine we are sitting together — I’m showing you real messy data, explaining what usually goes wrong, why we clean in certain order, and how people actually clean datasets in real projects.)

Cleaning data is usually 60–80% of real data work. The goal is not to make data “perfect”, but to make it usable and trustworthy for analysis.

Our example – a messy, realistic dataset

Let’s create a very typical “dirty” dataset that you’ll see in real life (sales / customer data):

Python

This table has almost every common mess you will meet:

  • Inconsistent capitalization
  • Extra spaces
  • Missing values in different formats (”, ‘N/A’, np.nan)
  • Different date formats
  • Currency symbols & commas in numbers
  • Inconsistent phone number formats
  • Trailing/leading spaces in categories
  • Inconsistent status values

Let’s clean it step by step — in the order most people actually do it.

Step-by-step realistic cleaning workflow

1. First look — always start here

Python

2. Fix column names (very early step)

Python

3. Handle missing values — understand first, then decide

Python

Common realistic decisions:

Python

4. Clean strings — the biggest source of mess

Python

5. Clean numbers — remove currency, commas, convert to proper type

Python

6. Clean phone numbers (very common task)

Python

7. Fix dates — one of the hardest parts

Python

8. Final checks after cleaning

Python

Realistic cleaning order most people follow

  1. Fix column names
  2. Look at missing values → decide strategy per column
  3. Strip spaces everywhere (strings)
  4. Standardize case & replace known typos
  5. Convert numbers (remove symbols, commas)
  6. Convert dates (mixed formats → pain)
  7. Standardize categories (status, city, product type…)
  8. Create derived columns (total_value, year, month…)
  9. Final missing value treatment
  10. Check dtypes again
  11. Look at head/tail/value_counts again

Your turn — mini cleaning exercise

Take this tiny messy row:

text

Write code to clean it to:

  • customerid → ‘C015’
  • name → ‘Amit Kumar’
  • city → ‘Bengaluru’
  • amount → 2999.0 (float)
  • orderdate → datetime
  • status → ‘Delivered’

Try it — then we can compare approaches.

Where do you want to go next?

  • Cleaning very messy dates in detail
  • Dealing with duplicates (find & remove)
  • Handling outliers realistically
  • String cleaning patterns (regex examples)
  • Category standardization with mapping & fuzzy matching
  • Cleaning real CSV together (you provide or I give messy one)

Tell me what you want to focus on — I’ll go deep with examples. 😊

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *