Data Engineering · Python · Pandas · Data Cleaning
Handling Duplicate Rows in Pandas — Identify, Remove & Export Clean Data
Duplicate rows are one of the most common data quality issues — and one of the most damaging to model accuracy and analysis reliability. Pandas gives you precise tools to detect, inspect, and remove duplicates with a single line of code. This guide walks through the full pipeline: load, detect, choose a strategy, clean, and export.
Context
Why duplicates matter
Duplicate rows skew aggregations, inflate record counts, bias ML model training, and produce misleading visualizations. A sales total that counts the same transaction twice, a classifier trained on repeated samples — both produce results that look correct but aren’t. Clean data is the foundation everything else is built on.
Keep First
Drop all duplicates except the first occurrence. The original record is preserved. Most common default choice.
Keep Last
Drop all duplicates except the last occurrence. Useful when later records represent updated values.
Drop All
Remove every instance of a duplicated row — including the first. Use when any duplicated record is invalid.
Pipeline
The four-step deduplication pipeline
Load
Read the raw CSV into a DataFrame with pd.read_csv().
Detect
Use df.duplicated() to identify and inspect all duplicate rows before touching the data.
Remove
Call drop_duplicates(keep=...) with your chosen strategy. Reset the index for a clean sequential result.
Export
Write the cleaned DataFrame back to CSV with to_csv() for downstream use.
Step 1
Load your dataset
Start by reading your data into a pandas DataFrame. pd.read_csv() is the standard entry point for flat files. From here, all deduplication operations work on the in-memory DataFrame — your source file is never modified.
import pandas as pd
# Read the raw dataset into a DataFrame
df = pd.read_csv('your_data_file.csv')
# Quick shape check before cleaning
print(f"Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")pd.read_excel() for XLSX, pd.read_parquet() for Parquet, or query a database with pd.read_sql() — all return a DataFrame you can clean the same way.
Step 2
Detect & inspect duplicates
df.duplicated() returns a boolean Series — True for every row that is a duplicate of an earlier row. Always inspect before you remove — understanding what the duplicates look like helps you choose the right strategy.
# Boolean mask: True for every row that is a duplicate
duplicate_mask = df.duplicated()
# How many duplicates exist?
print(f"Duplicate rows found: {duplicate_mask.sum():,}")
# Inspect the duplicate rows themselves
duplicates = df[df.duplicated()]
print(duplicates)
# See ALL occurrences of duplicated rows (including originals)
all_dupes = df[df.duplicated(keep=False)]
print(all_dupes.sort_values(by=df.columns.tolist()))duplicated() checks all columns. To flag rows that are duplicates only on specific columns (e.g. same customer_id): df.duplicated(subset=['customer_id']). This is useful for finding logical duplicates even when other columns differ.
Step 3
Remove duplicates — three strategies
drop_duplicates() returns a new DataFrame by default — the original is untouched. The keep parameter controls which occurrence survives. After removing, reset_index(drop=True) gives you a clean sequential index starting from 0.
# ── Strategy 1: keep the FIRST occurrence (default) ──
df_keep_first = df.drop_duplicates(keep='first')
# ── Strategy 2: keep the LAST occurrence ──
# useful when later rows represent updated/corrected records
df_keep_last = df.drop_duplicates(keep='last')
# ── Strategy 3: drop ALL occurrences of any duplicated row ──
# use when any repeated row is invalid data
df_drop_all = df.drop_duplicates(keep=False)
# ── Subset: deduplicate only on specific columns ──
df_subset = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='first')
# ── Reset the index after removal (clean 0-based index) ──
df_cleaned = df_keep_first.reset_index(drop=True, inplace=False)
# Confirm rows removed
print(f"Before: {len(df):,} | After: {len(df_cleaned):,} | Removed: {len(df) - len(df_cleaned):,}")drop_duplicates(inplace=True) modifies the DataFrame in place and returns None. Prefer the assignment pattern (df_cleaned = df.drop_duplicates()) — it preserves the original for comparison and makes your code easier to debug.
Step 4
Export the clean data
Write the deduplicated DataFrame back to a CSV. Setting index=False prevents pandas from writing the row index as an extra column — your downstream consumers will thank you.
# Export to CSV — index=False keeps the file clean
df_cleaned.to_csv('cleaned_data.csv', index=False)
print("Cleaned data exported to cleaned_data.csv")
# Optional: also export to Parquet for better performance at scale
df_cleaned.to_parquet('cleaned_data.parquet', index=False)Complete Reference
Full deduplication script
Everything in one place — load, detect, remove (keep first), reset index, and export.
import pandas as pd
# ── 1. Load ─────────────────────────────────────────────
df = pd.read_csv('your_data_file.csv')
print(f"Loaded {len(df):,} rows")
# ── 2. Detect ────────────────────────────────────────────
duplicates = df[df.duplicated()]
print(f"Duplicate rows found: {len(duplicates):,}")
print(duplicates)
# ── 3a. Keep last occurrence of each duplicate row ───────
df_cleaned = df.drop_duplicates(keep='last')
# ── 3b. Keep first occurrence (swap in if preferred) ─────
# df_cleaned = df.drop_duplicates(keep='first')
# ── 3c. Reset the index to a clean 0-based sequence ──────
df_cleaned.reset_index(drop=True, inplace=True)
print(f"Rows after cleaning: {len(df_cleaned):,}")
# ── 4. Export ─────────────────────────────────────────────
df_cleaned.to_csv('cleaned_data.csv', index=False)
print("Exported to cleaned_data.csv")Interview Prep
Cheat sheet — quick definitions to remember
What does
df.duplicated() return?True for every row that is a duplicate of a previously seen row, False otherwise. The first occurrence is marked False by default.
keep=’first’ vs keep=’last’ vs keep=False
What does the
subset parameter do?duplicated() and drop_duplicates() compare all columns. The subset parameter restricts the comparison to specific columns — for example subset=['customer_id'] finds rows with the same customer ID even if other columns differ.
Why call
reset_index(drop=True) after deduplication?reset_index(drop=True) renumbers from 0 continuously. drop=True prevents the old index from being added as a column.
inplace=True vs reassignment — which is preferred?
df_cleaned = df.drop_duplicates()) — it preserves the original DataFrame for comparison and makes pipelines easier to debug. inplace=True modifies the object and returns None, which can cause confusion when chaining operations. Many pandas best-practice guides now recommend avoiding inplace.
How do you handle duplicates in a production data pipeline?
When should you NOT remove duplicates?
subset to deduplicate on business keys, not entire rows.