Pandas Remove Duplicates

8 min read

Data Engineering · Python · Pandas · Data Cleaning

Handling Duplicate Rows in Pandas — Identify, Remove & Export Clean Data

Librarypandas

Methodsduplicated() · drop_duplicates() · reset_index()

OutputCleaned CSV

Stack Python pandas df.duplicated() drop_duplicates() reset_index() to_csv()

Duplicate rows are one of the most common data quality issues — and one of the most damaging to model accuracy and analysis reliability. Pandas gives you precise tools to detect, inspect, and remove duplicates with a single line of code. This guide walks through the full pipeline: load, detect, choose a strategy, clean, and export.

Context

Why duplicates matter

Duplicate rows skew aggregations, inflate record counts, bias ML model training, and produce misleading visualizations. A sales total that counts the same transaction twice, a classifier trained on repeated samples — both produce results that look correct but aren’t. Clean data is the foundation everything else is built on.

keep=’first’

Keep First

Drop all duplicates except the first occurrence. The original record is preserved. Most common default choice.

keep=’last’

Keep Last

Drop all duplicates except the last occurrence. Useful when later records represent updated values.

keep=False

Drop All

Remove every instance of a duplicated row — including the first. Use when any duplicated record is invalid.


Pipeline

The four-step deduplication pipeline

1

Load

Read the raw CSV into a DataFrame with pd.read_csv().

2

Detect

Use df.duplicated() to identify and inspect all duplicate rows before touching the data.

3

Remove

Call drop_duplicates(keep=...) with your chosen strategy. Reset the index for a clean sequential result.

4

Export

Write the cleaned DataFrame back to CSV with to_csv() for downstream use.


Step 1

Load your dataset

Start by reading your data into a pandas DataFrame. pd.read_csv() is the standard entry point for flat files. From here, all deduplication operations work on the in-memory DataFrame — your source file is never modified.

load_data.pyPython
import pandas as pd

# Read the raw dataset into a DataFrame
df = pd.read_csv('your_data_file.csv')

# Quick shape check before cleaning
print(f"Rows: {df.shape[0]:,}  |  Columns: {df.shape[1]}")
Other sources: The same deduplication logic applies regardless of how you load your data. Use pd.read_excel() for XLSX, pd.read_parquet() for Parquet, or query a database with pd.read_sql() — all return a DataFrame you can clean the same way.

Step 2

Detect & inspect duplicates

df.duplicated() returns a boolean Series — True for every row that is a duplicate of an earlier row. Always inspect before you remove — understanding what the duplicates look like helps you choose the right strategy.

detect_duplicates.pyPython
# Boolean mask: True for every row that is a duplicate
duplicate_mask = df.duplicated()

# How many duplicates exist?
print(f"Duplicate rows found: {duplicate_mask.sum():,}")

# Inspect the duplicate rows themselves
duplicates = df[df.duplicated()]
print(duplicates)

# See ALL occurrences of duplicated rows (including originals)
all_dupes = df[df.duplicated(keep=False)]
print(all_dupes.sort_values(by=df.columns.tolist()))
Subset duplicates: By default duplicated() checks all columns. To flag rows that are duplicates only on specific columns (e.g. same customer_id): df.duplicated(subset=['customer_id']). This is useful for finding logical duplicates even when other columns differ.

Step 3

Remove duplicates — three strategies

drop_duplicates() returns a new DataFrame by default — the original is untouched. The keep parameter controls which occurrence survives. After removing, reset_index(drop=True) gives you a clean sequential index starting from 0.

remove_duplicates.pyPython
# ── Strategy 1: keep the FIRST occurrence (default) ──
df_keep_first = df.drop_duplicates(keep='first')

# ── Strategy 2: keep the LAST occurrence ──
#    useful when later rows represent updated/corrected records
df_keep_last = df.drop_duplicates(keep='last')

# ── Strategy 3: drop ALL occurrences of any duplicated row ──
#    use when any repeated row is invalid data
df_drop_all = df.drop_duplicates(keep=False)

# ── Subset: deduplicate only on specific columns ──
df_subset = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='first')

# ── Reset the index after removal (clean 0-based index) ──
df_cleaned = df_keep_first.reset_index(drop=True, inplace=False)

# Confirm rows removed
print(f"Before: {len(df):,}  |  After: {len(df_cleaned):,}  |  Removed: {len(df) - len(df_cleaned):,}")
inplace vs assignment: drop_duplicates(inplace=True) modifies the DataFrame in place and returns None. Prefer the assignment pattern (df_cleaned = df.drop_duplicates()) — it preserves the original for comparison and makes your code easier to debug.

Step 4

Export the clean data

Write the deduplicated DataFrame back to a CSV. Setting index=False prevents pandas from writing the row index as an extra column — your downstream consumers will thank you.

export.pyPython
# Export to CSV — index=False keeps the file clean
df_cleaned.to_csv('cleaned_data.csv', index=False)

print("Cleaned data exported to cleaned_data.csv")

# Optional: also export to Parquet for better performance at scale
df_cleaned.to_parquet('cleaned_data.parquet', index=False)

Complete Reference

Full deduplication script

Everything in one place — load, detect, remove (keep first), reset index, and export.

deduplicate.py — full scriptPython
import pandas as pd

# ── 1. Load ─────────────────────────────────────────────
df = pd.read_csv('your_data_file.csv')
print(f"Loaded {len(df):,} rows")

# ── 2. Detect ────────────────────────────────────────────
duplicates = df[df.duplicated()]
print(f"Duplicate rows found: {len(duplicates):,}")
print(duplicates)

# ── 3a. Keep last occurrence of each duplicate row ───────
df_cleaned = df.drop_duplicates(keep='last')

# ── 3b. Keep first occurrence (swap in if preferred) ─────
# df_cleaned = df.drop_duplicates(keep='first')

# ── 3c. Reset the index to a clean 0-based sequence ──────
df_cleaned.reset_index(drop=True, inplace=True)

print(f"Rows after cleaning: {len(df_cleaned):,}")

# ── 4. Export ─────────────────────────────────────────────
df_cleaned.to_csv('cleaned_data.csv', index=False)
print("Exported to cleaned_data.csv")

Interview Prep

Cheat sheet — quick definitions to remember

Define
What does df.duplicated() return?
A boolean Series the same length as the DataFrame — True for every row that is a duplicate of a previously seen row, False otherwise. The first occurrence is marked False by default.
Boolean SeriesTrue = duplicateFirst = False by default
Compare
keep=’first’ vs keep=’last’ vs keep=False
first — keeps the first occurrence, drops all subsequent duplicates. last — keeps the final occurrence, useful for updated records. False — drops every occurrence of any duplicated row, leaving only rows that were unique to begin with.
first = keep originallast = keep latestFalse = drop all copies
Explain
What does the subset parameter do?
By default, duplicated() and drop_duplicates() compare all columns. The subset parameter restricts the comparison to specific columns — for example subset=['customer_id'] finds rows with the same customer ID even if other columns differ.
Default = all columnssubset = logical dedup
Gotcha
Why call reset_index(drop=True) after deduplication?
After dropping rows, the DataFrame retains the original row indices — you’d have gaps like 0, 1, 4, 7 instead of 0, 1, 2, 3. reset_index(drop=True) renumbers from 0 continuously. drop=True prevents the old index from being added as a column.
Index gaps after dropreset_index fixes gapsdrop=True prevents extra col
Gotcha
inplace=True vs reassignment — which is preferred?
Prefer reassignment (df_cleaned = df.drop_duplicates()) — it preserves the original DataFrame for comparison and makes pipelines easier to debug. inplace=True modifies the object and returns None, which can cause confusion when chaining operations. Many pandas best-practice guides now recommend avoiding inplace.
Reassignment = saferinplace returns None
Best Practice
How do you handle duplicates in a production data pipeline?
Three layers: (1) Detect and log before removing — store duplicate counts as data quality metrics. (2) Deduplicate at ingestion, not at query time — clean once, use many times. (3) Add a unique constraint in your database or Delta Lake table to prevent duplicates from re-entering at source.
Log before removingClean at ingestionDB unique constraints
Use Case
When should you NOT remove duplicates?
When the repeated rows represent legitimate repeated events — a customer placing the same order twice on different days, a sensor reading the same value consecutively, or audit log entries. Always validate with domain knowledge before dropping. Use subset to deduplicate on business keys, not entire rows.
Repeated events = validUse subset= for business keys

From Amazon Reviews to Numbers: A Hands-On Tour of…

NLP · Machine Learning · Text Feature Engineering From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF Corpus128...
Vijay Gokarn
8 min read

The GenAI Landscape: From Zero to Transformer Series name:…

GenAI Mastery Series · Chapter 02 · March 28, 2026 Coding Assistants, the AI/ML Roadmap, and How Machines Learn to Understand Language Read~14 min...
Vijay Gokarn
12 min read

Creating AI Storytelling Agents Using Flowise: A Step-by-Step Guide

GenAI Mastery Series · Agentic AI · Flowise Walkthrough Building an AI Storytelling Agent with Flowise — No Code Required StackFlowise · OpenAI GPT-4...
Vijay Gokarn
17 min read