Chapter 12: Removing Duplicates
Removing duplicates in pandas (Imagine we are sitting together with a laptop — I’m showing you the code, explaining what “duplicate” really means in practice, why it happens, when to remove them, when NOT to remove them, and how people actually handle duplicates in real projects.)
Why duplicates appear in real data
Duplicates are one of the most common data quality problems. They usually come from:
- Multiple data sources merged together
- The same person/order/event entered twice by mistake
- Data appended over time without checking for existing records
- API calls that returned overlapping results
- Copy-paste errors in Excel
- Joins that went wrong (especially cross joins or missing keys)
Important: Not every duplicate row is necessarily bad — sometimes rows look identical but actually represent different events (e.g. same customer placing two identical orders on the same day).
1. Let’s create a realistic example with different kinds of duplicates
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
import pandas as pd import numpy as np data = { 'order_id': [1001, 1002, 1001, 1003, 1004, 1005, 1006, 1004, 1007, 1008], 'customer_id': ['C023', 'C045', 'C023', 'C067', 'C089', 'C023', 'C045', 'C089', 'C112', 'C023'], 'customer_name':['Priya S', 'Rahul K', 'Priya Sharma', 'Aniket R', 'Sneha P', 'Priya Sharma', 'Rahul Kumar', 'Sneha Patil', 'Meera V', 'Priya S'], 'order_date': ['2025-01-15', '2025-01-16', '2025-01-15', '2025-01-17', '2025-01-18', '2025-01-15', '2025-01-16', '2025-01-18', '2025-01-19', '2025-01-15'], 'amount': [1499, 899, 1499, 2499, 3499, 1499, 899, 3499, 1299, 1499], 'product': ['Mouse', 'Keyboard', 'Mouse', 'Monitor', 'Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Headphones', 'Mouse'], 'status': ['Delivered', 'Pending', 'Delivered', 'Shipped', 'Delivered', 'Delivered', 'Pending', 'Delivered', 'Shipped', 'Delivered'], 'source': ['website', 'app', 'website', 'website', 'app', 'website', 'app', 'app', 'website', 'website'] } df = pd.DataFrame(data) df['order_date'] = pd.to_datetime(df['order_date']) print("Original shape:", df.shape) df |
2. The most important question first: What counts as a duplicate for your analysis?
You almost always need to decide:
- Exact duplicate → all columns are identical
- Duplicate based on key columns → same order_id, or same customer_id + order_date + amount + product
- Duplicate customer → same customer_id (even if names are spelled differently)
- Near-duplicate → almost same, but small differences (spaces, capitalization)
3. Method 1 – Finding exact duplicates (all columns identical)
|
0 1 2 3 4 5 6 7 |
# Show duplicate rows (including first occurrence) df[df.duplicated(keep=False)] |
|
0 1 2 3 4 5 6 7 8 9 10 |
# Count how many rows are exact duplicates print("Number of exact duplicate rows:", df.duplicated().sum()) # Show only the duplicated ones (first occurrence kept) df[df.duplicated()] |
4. Method 2 – Removing exact duplicates
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Most common: keep first occurrence, drop the rest df_clean = df.drop_duplicates() # Alternative: keep the last occurrence (useful for latest data) df_clean = df.drop_duplicates(keep='last') # Drop all duplicates — remove even the first occurrence df_clean = df.drop_duplicates(keep=False) print("After removing exact duplicates:", df_clean.shape) |
5. Method 3 – Removing duplicates based on specific columns (most common in real work)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Same order_id should appear only once df_clean = df.drop_duplicates(subset=['order_id'], keep='first') # Same customer + same date + same product = duplicate order df_clean = df.drop_duplicates( subset=['customer_id', 'order_date', 'product', 'amount'], keep='last' # keep the most recent entry ) print(df_clean.shape) |
6. Method 4 – Case-insensitive & space-insensitive duplicates
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Normalize before checking df['customer_name_lower'] = df['customer_name'].str.lower().str.strip() # Now drop duplicates considering normalized name df_clean = df.drop_duplicates( subset=['customer_id', 'customer_name_lower', 'order_date'], keep='first' ) # You can drop the helper column afterward df_clean = df_clean.drop(columns=['customer_name_lower']) |
7. Method 5 – Showing duplicates with context (very useful for investigation)
|
0 1 2 3 4 5 6 7 8 |
# Show all rows that have duplicate order_id duplicate_order_ids = df[df.duplicated(subset=['order_id'], keep=False)] duplicate_order_ids.sort_values('order_id') |
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Show customers who ordered the same thing multiple times on same day suspect = df[df.duplicated( subset=['customer_id', 'order_date', 'product'], keep=False )].sort_values(['customer_id', 'order_date']) print("Possible repeated orders:") suspect |
8. Real-world patterns people actually use
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# Pattern 1: Remove exact duplicates first, then key-based df = df.drop_duplicates() # exact df = df.drop_duplicates(subset=['order_id']) # business key # Pattern 2: Keep the row with most complete information df = df.sort_values('source') # prefer 'app' over 'website' df = df.drop_duplicates(subset=['order_id'], keep='last') # Pattern 3: Mark duplicates instead of removing df['is_duplicate'] = df.duplicated(subset=['order_id'], keep=False) df['duplicate_count'] = df.groupby('order_id')['order_id'].transform('count') # Pattern 4: Aggregate instead of dropping df_summary = df.groupby('order_id').agg({ 'customer_id': 'first', 'amount': 'sum', 'product': 'first', 'status': 'last', 'order_date': 'first' }).reset_index() |
9. Quick reference – commands you will use most often
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# Check df.duplicated() # boolean Series df.duplicated().sum() # count df.duplicated(subset=['col1','col2']) # based on columns df.duplicated(keep=False) # show all copies # Remove df.drop_duplicates() # exact, keep first df.drop_duplicates(keep='last') # keep last df.drop_duplicates(subset=['col1']) # based on columns df.drop_duplicates(keep=False) # remove all copies # Useful helpers df.duplicated(subset=..., keep=False).sum() # how many rows are involved in duplication df[df.duplicated(..., keep=False)].sort_values(...) # inspect them |
10. Mini practice task for you
Using the DataFrame above:
- Find all rows where the same customer_id ordered the same product more than once
- Remove duplicates keeping the most recent order (order_date latest)
- After cleaning, how many unique orders remain?
Try writing the code — then come back and we can compare approaches.
Where would you like to go next?
- Dealing with near-duplicates (fuzzy matching, typos in names)
- Finding logical duplicates (same customer, same amount, close dates…)
- Keeping first occurrence vs last occurrence vs most complete row
- Deduplicating very large datasets efficiently
- When you should NOT remove duplicates (and what to do instead)
Just tell me which direction feels most useful — I’ll explain slowly with examples. 😊
