Chapter 12: Removing Duplicates

Removing duplicates in pandas (Imagine we are sitting together with a laptop — I’m showing you the code, explaining what “duplicate” really means in practice, why it happens, when to remove them, when NOT to remove them, and how people actually handle duplicates in real projects.)

Why duplicates appear in real data

Duplicates are one of the most common data quality problems. They usually come from:

  • Multiple data sources merged together
  • The same person/order/event entered twice by mistake
  • Data appended over time without checking for existing records
  • API calls that returned overlapping results
  • Copy-paste errors in Excel
  • Joins that went wrong (especially cross joins or missing keys)

Important: Not every duplicate row is necessarily bad — sometimes rows look identical but actually represent different events (e.g. same customer placing two identical orders on the same day).

1. Let’s create a realistic example with different kinds of duplicates

Python

2. The most important question first: What counts as a duplicate for your analysis?

You almost always need to decide:

  • Exact duplicate → all columns are identical
  • Duplicate based on key columns → same order_id, or same customer_id + order_date + amount + product
  • Duplicate customer → same customer_id (even if names are spelled differently)
  • Near-duplicate → almost same, but small differences (spaces, capitalization)

3. Method 1 – Finding exact duplicates (all columns identical)

Python
Python

4. Method 2 – Removing exact duplicates

Python

5. Method 3 – Removing duplicates based on specific columns (most common in real work)

Python

6. Method 4 – Case-insensitive & space-insensitive duplicates

Python

7. Method 5 – Showing duplicates with context (very useful for investigation)

Python
Python

8. Real-world patterns people actually use

Python

9. Quick reference – commands you will use most often

Python

10. Mini practice task for you

Using the DataFrame above:

  1. Find all rows where the same customer_id ordered the same product more than once
  2. Remove duplicates keeping the most recent order (order_date latest)
  3. After cleaning, how many unique orders remain?

Try writing the code — then come back and we can compare approaches.

Where would you like to go next?

  • Dealing with near-duplicates (fuzzy matching, typos in names)
  • Finding logical duplicates (same customer, same amount, close dates…)
  • Keeping first occurrence vs last occurrence vs most complete row
  • Deduplicating very large datasets efficiently
  • When you should NOT remove duplicates (and what to do instead)

Just tell me which direction feels most useful — I’ll explain slowly with examples. 😊

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *