Chapter 10: Cleaning Wrong Format
cleaning data that is in the wrong format in pandas.
Imagine we are sitting together with a laptop. I’m showing you a real messy dataset, explaining why these problems appear, how to recognize them quickly, what realistic choices we have, and how people actually clean “wrong format” data in everyday work.
What do we mean by “wrong format”?
“Wrong format” usually means one (or several) of these situations:
- Numbers are stored as text → “₹1,499.00” or “1 250” or “N/A”
- Dates are strings in many different formats → “15-03-2024”, “2024/04/22”, “March 5th 2024”, “05/03/24”
- Percentages are text → “78.5%”, “82 %”, “0.785”
- Phone numbers / IDs have inconsistent separators → “+91-98765 43210”, “9876543210”, “(987) 654-3210”
- Booleans are written as text → “Yes”/”No”, “Y”/”N”, “True”/”true”/”TRUE”
- Categories have typos, extra spaces, mixed case → “electronics “, “Electronics”, “ELECTRONICS”, “electro”
- Mixed types in one column → some numbers, some text, some missing markers
These are very common — especially when data comes from Excel, CSV exports, web forms, APIs, or manual entry.
Realistic example — messy sales / customer data
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import pandas as pd import numpy as np messy = { 'order_id': ['ORD-001', 'ORD-002', 'ORD-003', 'ORD-004', 'ORD-005', 'ORD-006', 'ORD-007', 'ORD-008'], 'customer': ['Priya Sharma', 'rahul k', 'ANANYA ROY ', 'Sneha', 'Vikram', 'Meera Singh', 'arjun', np.nan], 'order_date': ['15-03-2024', '2024/04/22', '22-06-24', '2024-07-05', 'August 8 2024', '08/09/2024', 'Sep 12, 2024', '15-Oct-24'], 'amount': ['₹1,499.00', '899', '2,499.50', '1299', '₹3,499', '599.00 INR', 'N/A', '2 999'], 'discount_pct': ['10%', '5.5 %', '0', '15', '20%', '', '8.5%', 'N/A'], 'payment_mode': ['UPI', 'upi', 'CARD', 'Cash on Delivery', 'UPI ', 'Netbanking', 'card', 'UPI'], 'quantity': ['3', 'five', '2', '1', '4', '10', '2', '6'], 'status': ['Delivered', 'delivered', 'Pending ', 'Shipped', 'cancelled', 'Delivered', 'Delivered ', 'Returned'] } df = pd.DataFrame(messy) |
Let’s clean it step by step — exactly the way people do it in real projects.
Step-by-step cleaning of wrong formats
1. Quick inspection — always start here
|
0 1 2 3 4 5 6 7 8 9 10 11 |
df.info() df.head(8) df.tail(3) # Look for object columns — they are suspicious print(df.select_dtypes(include='object').columns.tolist()) |
Typical suspects: amount, discount_pct, quantity, order_date, payment_mode, status
2. Clean numeric columns stored as text
amount
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Goal: turn into proper float df['amount_clean'] = ( df['amount'] .astype(str) # make sure everything is string .str.replace(r'[₹$,INR\s]', '', regex=True) # remove currency & spaces .str.replace(',', '', regex=False) # remove thousand separators .replace(['N/A', 'NA', '-', ''], np.nan) # mark missing .astype(float) # now convert ) # Check result df[['amount', 'amount_clean']] |
discount_pct
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
df['discount_pct_clean'] = ( df['discount_pct'] .astype(str) .str.replace(r'[%\s]', '', regex=True) # remove % and spaces .replace(['', 'N/A'], np.nan) .astype(float) / 100 # convert to 0.10 instead of 10 ) df['discount_pct_clean'] = df['discount_pct_clean'].fillna(0) # business rule |
quantity — mixed words and numbers
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# Simple mapping for words (you can extend it) word_to_num = { 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8, 'nine': 9, 'ten': 10 } df['quantity_clean'] = ( df['quantity'] .astype(str) .str.lower() .map(word_to_num) # replace words .fillna(df['quantity']) # keep numbers that were already correct .astype(str) .str.replace(r'[^\d.]', '', regex=True) # remove anything that's not digit or . .replace('', np.nan) .astype(float) .fillna(1) # assume 1 if missing .astype(int) ) |
3. Fix dates — one of the hardest & most common problems
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# Try automatic parsing first (often works partially) df['order_date_clean'] = pd.to_datetime(df['order_date'], errors='coerce') # See which failed failed = df[df['order_date_clean'].isna()] print(failed[['order_date', 'order_date_clean']]) # Handle remaining problematic formats manually from datetime import datetime def parse_mixed_date(date_str): if pd.isna(date_str): return pd.NaT date_str = str(date_str).strip() # Try common formats for fmt in [ '%d-%m-%Y', '%Y/%m/%d', '%d-%m-%y', '%Y-%m-%d', '%b %d %Y', '%d %b %Y', '%d-%b-%y', '%d-%b-%Y', '%B %d, %Y', '%d %b, %Y', '%d-%b-%Y', '%d/%m/%Y' ]: try: return datetime.strptime(date_str, fmt) except ValueError: continue return pd.NaT # Apply custom parser to failed rows mask_failed = df['order_date_clean'].isna() df.loc[mask_failed, 'order_date_clean'] = ( df.loc[mask_failed, 'order_date'].apply(parse_mixed_date) ) # Final safety net df['order_date_clean'] = pd.to_datetime(df['order_date_clean'], errors='coerce') |
4. Clean categorical / text columns
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# payment_mode df['payment_mode'] = ( df['payment_mode'] .str.strip() .str.lower() .replace({ 'upi': 'UPI', 'card': 'Card', 'cash on delivery': 'COD', 'netbanking': 'Netbanking' }) ) # status df['status'] = ( df['status'] .str.strip() .str.capitalize() .replace({ 'Delivered ': 'Delivered', 'Pending ': 'Pending' }) ) |
5. Final quick validation
|
0 1 2 3 4 5 6 7 8 9 10 11 |
print("After cleaning:") df.info() print("\nDate parsing success rate:", df['order_date_clean'].notna().mean().round(3) * 100, "%") print("\nNumeric columns:") print(df.select_dtypes(include=['float64','int64']).describe()) |
Realistic cleaning order for wrong-format columns
- Look — df.info(), df.head(), select_dtypes(‘object’)
- String cleanup first — strip(), lower()/title(), replace known typos
- Numeric columns — remove symbols, commas, currency, then pd.to_numeric(…, errors=’coerce’)
- Dates — try pd.to_datetime(errors=’coerce’) → handle failures with custom parser or format list
- Categories — standardize case, remove spaces, map synonyms
- Percentages — remove % → divide by 100
- Check again — info(), describe(), value_counts()
Small practice task for you
Clean these values into proper types:
|
0 1 2 3 4 5 6 7 8 |
s_amount = pd.Series(['₹4,299', '899.50', 'N/A', '1 250', '2999 INR', '']) s_date = pd.Series(['12/05/2025', '2025-06-03', 'June 15 2025', '03-07-25', 'invalid']) s_pct = pd.Series(['12.5%', '5 %', '0', '18.75', '']) |
Try to turn them into:
- float for amount & percentage
- datetime for date
Show your code — then we can discuss the best approach together.
Where would you like to go next?
- More advanced date parsing tricks & libraries
- Dealing with very inconsistent categories (fuzzy matching)
- Cleaning phone numbers & postal codes properly
- Converting mixed-type columns safely
- Automating format detection (when you have many columns)
Tell me what you want to focus on — I’ll explain with many examples. 😊
