Chapter 10: Cleaning Wrong Format

cleaning data that is in the wrong format in pandas.

Imagine we are sitting together with a laptop. I’m showing you a real messy dataset, explaining why these problems appear, how to recognize them quickly, what realistic choices we have, and how people actually clean “wrong format” data in everyday work.

What do we mean by “wrong format”?

“Wrong format” usually means one (or several) of these situations:

  • Numbers are stored as text → “₹1,499.00” or “1 250” or “N/A”
  • Dates are strings in many different formats → “15-03-2024”, “2024/04/22”, “March 5th 2024”, “05/03/24”
  • Percentages are text → “78.5%”, “82 %”, “0.785”
  • Phone numbers / IDs have inconsistent separators → “+91-98765 43210”, “9876543210”, “(987) 654-3210”
  • Booleans are written as text → “Yes”/”No”, “Y”/”N”, “True”/”true”/”TRUE”
  • Categories have typos, extra spaces, mixed case → “electronics “, “Electronics”, “ELECTRONICS”, “electro”
  • Mixed types in one column → some numbers, some text, some missing markers

These are very common — especially when data comes from Excel, CSV exports, web forms, APIs, or manual entry.

Realistic example — messy sales / customer data

Python

Let’s clean it step by step — exactly the way people do it in real projects.

Step-by-step cleaning of wrong formats

1. Quick inspection — always start here

Python

Typical suspects: amount, discount_pct, quantity, order_date, payment_mode, status

2. Clean numeric columns stored as text

amount

Python

discount_pct

Python

quantity — mixed words and numbers

Python

3. Fix dates — one of the hardest & most common problems

Python

4. Clean categorical / text columns

Python

5. Final quick validation

Python

Realistic cleaning order for wrong-format columns

  1. Look — df.info(), df.head(), select_dtypes(‘object’)
  2. String cleanup first — strip(), lower()/title(), replace known typos
  3. Numeric columns — remove symbols, commas, currency, then pd.to_numeric(…, errors=’coerce’)
  4. Dates — try pd.to_datetime(errors=’coerce’) → handle failures with custom parser or format list
  5. Categories — standardize case, remove spaces, map synonyms
  6. Percentages — remove % → divide by 100
  7. Check again — info(), describe(), value_counts()

Small practice task for you

Clean these values into proper types:

Python

Try to turn them into:

  • float for amount & percentage
  • datetime for date

Show your code — then we can discuss the best approach together.

Where would you like to go next?

  • More advanced date parsing tricks & libraries
  • Dealing with very inconsistent categories (fuzzy matching)
  • Cleaning phone numbers & postal codes properly
  • Converting mixed-type columns safely
  • Automating format detection (when you have many columns)

Tell me what you want to focus on — I’ll explain with many examples. 😊

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *