Chapter 8: Cleaning Data
Cleaning Data in pandas (Imagine we are sitting together — I’m showing you real messy data, explaining what usually goes wrong, why we clean in certain order, and how people actually clean datasets in real projects.)
Cleaning data is usually 60–80% of real data work. The goal is not to make data “perfect”, but to make it usable and trustworthy for analysis.
Our example – a messy, realistic dataset
Let’s create a very typical “dirty” dataset that you’ll see in real life (sales / customer data):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import pandas as pd import numpy as np raw_data = { 'CustomerID': ['C001', 'C002', 'C003', 'C004', 'C005', 'C006', 'C007', 'C008', 'C009', 'C010'], 'Name': ['Priya Sharma', 'rahul kumar', 'ANANYA ROY ', 'Sneha ', 'Vikram Mehta', 'Meera ', 'Arjun Singh', 'Neha Gupta', np.nan, 'Riya Patel'], 'Email': ['priya.sharma@gmail.com', 'rahul.kumar@yahoo.com', 'ananya.roy@outlook.com', '', 'vikram.mehta@company.in', 'meera123@gmail', 'arjun@work.com', 'neha.gupta@gmail.com', 'unknown@xyz', np.nan], 'Phone': ['+91-9876543210', '98765 43210', '919876543211', '987-654-3210', '09876543210', '', '91 98765 43210', '9876543210', 'N/A', '987654321'], 'City': ['Bangalore', 'hyderabad', 'Pune', 'bangalore ', 'Mumbai', 'Pune', 'Delhi', 'Bangalore', 'Hyderabad', 'chennai'], 'OrderDate': ['15-03-2024', '2024/04/22', '2024-05-10', '22/06/2024', '2024-07-05', '08-08-2024', 'Sep 12 2024', '2024-10-01', '2024-11-15', np.nan], 'Amount': ['₹1,499', '899.00', '2499', '1,299', '₹3,499.50', '599', '₹2,999', '1499.99', 'N/A', '799'], 'Quantity': [3, 5, 2, 1, 4, 10, 2, 6, np.nan, 8], 'Category': ['Electronics', 'Home', 'electronics', 'Clothing', 'electronics ', 'Home & Kitchen', 'electronics', 'Clothing', 'Home', 'Electronics'], 'Status': ['Delivered', 'delivered', 'Pending', 'Delivered ', 'Cancelled', 'Delivered', 'delivered', 'Shipped', 'Delivered', 'Pending '] } df = pd.DataFrame(raw_data) |
This table has almost every common mess you will meet:
- Inconsistent capitalization
- Extra spaces
- Missing values in different formats (”, ‘N/A’, np.nan)
- Different date formats
- Currency symbols & commas in numbers
- Inconsistent phone number formats
- Trailing/leading spaces in categories
- Inconsistent status values
Let’s clean it step by step — in the order most people actually do it.
Step-by-step realistic cleaning workflow
1. First look — always start here
|
0 1 2 3 4 5 6 7 8 9 |
print(df.shape) df.info() df.head(10) df.isna().sum() |
2. Fix column names (very early step)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Make all lowercase, replace spaces with _, remove special characters df.columns = ( df.columns .str.lower() .str.strip() .str.replace(r'\s+', '_', regex=True) ) print(df.columns.tolist()) # → ['customerid', 'name', 'email', 'phone', 'city', 'orderdate', 'amount', 'quantity', 'category', 'status'] |
3. Handle missing values — understand first, then decide
|
0 1 2 3 4 5 6 7 8 9 10 |
# Show rows with any missing df[df.isna().any(axis=1)] # Count missing per column df.isna().sum() |
Common realistic decisions:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Drop rows where critical columns are missing df = df.dropna(subset=['customerid', 'orderdate']) # Fill email & phone with placeholder df['email'] = df['email'].fillna('unknown@email.com') df['phone'] = df['phone'].fillna('not_provided') # Fill quantity with 1 (business rule) df['quantity'] = df['quantity'].fillna(1) # For name — fill with 'Unknown Customer' df['name'] = df['name'].fillna('Unknown Customer') |
4. Clean strings — the biggest source of mess
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# Strip extra spaces df['name'] = df['name'].str.strip() df['city'] = df['city'].str.strip() df['category'] = df['category'].str.strip() df['status'] = df['status'].str.strip() # Standardize capitalization df['name'] = df['name'].str.title() # Priya Sharma df['city'] = df['city'].str.title() df['category'] = df['category'].str.title() df['status'] = df['status'].str.capitalize() # Fix known inconsistencies df['city'] = df['city'].replace({ 'Bangalore': 'Bengaluru', 'Hyderabad': 'Hyderabad', 'Chennai': 'Chennai' }) df['status'] = df['status'].replace({ 'Delivered ': 'Delivered', 'Pending ': 'Pending' }) |
5. Clean numbers — remove currency, commas, convert to proper type
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Remove ₹, commas, and convert to float df['amount'] = ( df['amount'] .replace(['N/A', 'unknown'], np.nan) # first handle text missing .astype(str) # make sure it's string .str.replace('₹', '', regex=False) .str.replace(',', '', regex=False) .astype(float) ) # Now it's numeric — we can do math df['total_value'] = df['amount'] * df['quantity'] |
6. Clean phone numbers (very common task)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Keep only digits df['phone_clean'] = ( df['phone'] .astype(str) .str.replace(r'[^0-9]', '', regex=True) # remove everything except digits ) # Optional: add +91 if missing and length=10 mask = (df['phone_clean'].str.len() == 10) & (~df['phone_clean'].str.startswith('0')) df.loc[mask, 'phone_clean'] = '91' + df['phone_clean'] # Or keep original and add cleaned version |
7. Fix dates — one of the hardest parts
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# Try automatic parsing first (often fails with mixed formats) df['orderdate'] = pd.to_datetime(df['orderdate'], errors='coerce') # Check which failed print(df[df['orderdate'].isna()]) # Handle known bad formats manually date_formats = { r'\d{2}-\d{2}-\d{4}': '%d-%m-%Y', r'\d{4}/\d{2}/\d{2}': '%Y/%m/%d', r'\d{4}-\d{2}-\d{2}': '%Y-%m-%d', r'\d{2}/\d{2}/\d{4}': '%d/%m/%Y', r'[A-Za-z]{3}\s\d{2}\s\d{4}': '%b %d %Y' } for pattern, fmt in date_formats.items(): mask = df['orderdate'].isna() & df['orderdate_original'].str.match(pattern, na=False) df.loc[mask, 'orderdate'] = pd.to_datetime( df.loc[mask, 'orderdate_original'], format=fmt, errors='coerce' ) # Final check df['orderdate'] = pd.to_datetime(df['orderdate'], errors='coerce') |
8. Final checks after cleaning
|
0 1 2 3 4 5 6 7 8 9 10 11 |
df.info() df.describe() df['city'].value_counts() df['status'].value_counts() df.isna().sum() df.head(12) |
Realistic cleaning order most people follow
- Fix column names
- Look at missing values → decide strategy per column
- Strip spaces everywhere (strings)
- Standardize case & replace known typos
- Convert numbers (remove symbols, commas)
- Convert dates (mixed formats → pain)
- Standardize categories (status, city, product type…)
- Create derived columns (total_value, year, month…)
- Final missing value treatment
- Check dtypes again
- Look at head/tail/value_counts again
Your turn — mini cleaning exercise
Take this tiny messy row:
|
0 1 2 3 4 5 6 7 8 9 10 11 |
CustomerID: ' c015 ' Name: ' amit KUMAR ' City: 'bangalore ' Amount: '₹ 2,999.00 ' OrderDate: '12-SEP-2024' Status: ' delivered' |
Write code to clean it to:
- customerid → ‘C015’
- name → ‘Amit Kumar’
- city → ‘Bengaluru’
- amount → 2999.0 (float)
- orderdate → datetime
- status → ‘Delivered’
Try it — then we can compare approaches.
Where do you want to go next?
- Cleaning very messy dates in detail
- Dealing with duplicates (find & remove)
- Handling outliers realistically
- String cleaning patterns (regex examples)
- Category standardization with mapping & fuzzy matching
- Cleaning real CSV together (you provide or I give messy one)
Tell me what you want to focus on — I’ll go deep with examples. 😊
