Pandas Remove Duplicates

Vijay Gokarn — Tue, 09 Jul 2024 11:12:55 +0000

Data Engineering · Python · Pandas · Data Cleaning

Handling Duplicate Rows in Pandas — Identify, Remove & Export Clean Data

Librarypandas

Methodsduplicated() · drop_duplicates() · reset_index()

OutputCleaned CSV

Stack Python pandas df.duplicated() drop_duplicates() reset_index() to_csv()

Duplicate rows are one of the most common data quality issues — and one of the most damaging to model accuracy and analysis reliability. Pandas gives you precise tools to detect, inspect, and remove duplicates with a single line of code. This guide walks through the full pipeline: load, detect, choose a strategy, clean, and export.

Context

Why duplicates matter

Duplicate rows skew aggregations, inflate record counts, bias ML model training, and produce misleading visualizations. A sales total that counts the same transaction twice, a classifier trained on repeated samples — both produce results that look correct but aren’t. Clean data is the foundation everything else is built on.

keep=’first’

Keep First

Drop all duplicates except the first occurrence. The original record is preserved. Most common default choice.

keep=’last’

Keep Last

Drop all duplicates except the last occurrence. Useful when later records represent updated values.

keep=False

Drop All

Remove every instance of a duplicated row — including the first. Use when any duplicated record is invalid.

Pipeline

The four-step deduplication pipeline

Load

Read the raw CSV into a DataFrame with pd.read_csv().

Detect

Use df.duplicated() to identify and inspect all duplicate rows before touching the data.

Remove

Call drop_duplicates(keep=...) with your chosen strategy. Reset the index for a clean sequential result.

Export

Write the cleaned DataFrame back to CSV with to_csv() for downstream use.

Step 1

Load your dataset

Start by reading your data into a pandas DataFrame. pd.read_csv() is the standard entry point for flat files. From here, all deduplication operations work on the in-memory DataFrame — your source file is never modified.

load_data.pyPython

import pandas as pd

# Read the raw dataset into a DataFrame
df = pd.read_csv('your_data_file.csv')

# Quick shape check before cleaning
print(f"Rows: {df.shape[0]:,}  |  Columns: {df.shape[1]}")

Other sources: The same deduplication logic applies regardless of how you load your data. Use pd.read_excel() for XLSX, pd.read_parquet() for Parquet, or query a database with pd.read_sql() — all return a DataFrame you can clean the same way.

Step 2

Detect & inspect duplicates

df.duplicated() returns a boolean Series — True for every row that is a duplicate of an earlier row. Always inspect before you remove — understanding what the duplicates look like helps you choose the right strategy.

detect_duplicates.pyPython

# Boolean mask: True for every row that is a duplicate
duplicate_mask = df.duplicated()

# How many duplicates exist?
print(f"Duplicate rows found: {duplicate_mask.sum():,}")

# Inspect the duplicate rows themselves
duplicates = df[df.duplicated()]
print(duplicates)

# See ALL occurrences of duplicated rows (including originals)
all_dupes = df[df.duplicated(keep=False)]
print(all_dupes.sort_values(by=df.columns.tolist()))

Subset duplicates: By default duplicated() checks all columns. To flag rows that are duplicates only on specific columns (e.g. same customer_id): df.duplicated(subset=['customer_id']). This is useful for finding logical duplicates even when other columns differ.

Step 3

Remove duplicates — three strategies

drop_duplicates() returns a new DataFrame by default — the original is untouched. The keep parameter controls which occurrence survives. After removing, reset_index(drop=True) gives you a clean sequential index starting from 0.

remove_duplicates.pyPython

# ── Strategy 1: keep the FIRST occurrence (default) ──
df_keep_first = df.drop_duplicates(keep='first')

# ── Strategy 2: keep the LAST occurrence ──
#    useful when later rows represent updated/corrected records
df_keep_last = df.drop_duplicates(keep='last')

# ── Strategy 3: drop ALL occurrences of any duplicated row ──
#    use when any repeated row is invalid data
df_drop_all = df.drop_duplicates(keep=False)

# ── Subset: deduplicate only on specific columns ──
df_subset = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='first')

# ── Reset the index after removal (clean 0-based index) ──
df_cleaned = df_keep_first.reset_index(drop=True, inplace=False)

# Confirm rows removed
print(f"Before: {len(df):,}  |  After: {len(df_cleaned):,}  |  Removed: {len(df) - len(df_cleaned):,}")

inplace vs assignment: drop_duplicates(inplace=True) modifies the DataFrame in place and returns None. Prefer the assignment pattern (df_cleaned = df.drop_duplicates()) — it preserves the original for comparison and makes your code easier to debug.

Step 4

Export the clean data

Write the deduplicated DataFrame back to a CSV. Setting index=False prevents pandas from writing the row index as an extra column — your downstream consumers will thank you.

export.pyPython

# Export to CSV — index=False keeps the file clean
df_cleaned.to_csv('cleaned_data.csv', index=False)

print("Cleaned data exported to cleaned_data.csv")

# Optional: also export to Parquet for better performance at scale
df_cleaned.to_parquet('cleaned_data.parquet', index=False)

Complete Reference

Full deduplication script

Everything in one place — load, detect, remove (keep first), reset index, and export.

deduplicate.py — full scriptPython

import pandas as pd

# ── 1. Load ─────────────────────────────────────────────
df = pd.read_csv('your_data_file.csv')
print(f"Loaded {len(df):,} rows")

# ── 2. Detect ────────────────────────────────────────────
duplicates = df[df.duplicated()]
print(f"Duplicate rows found: {len(duplicates):,}")
print(duplicates)

# ── 3a. Keep last occurrence of each duplicate row ───────
df_cleaned = df.drop_duplicates(keep='last')

# ── 3b. Keep first occurrence (swap in if preferred) ─────
# df_cleaned = df.drop_duplicates(keep='first')

# ── 3c. Reset the index to a clean 0-based sequence ──────
df_cleaned.reset_index(drop=True, inplace=True)

print(f"Rows after cleaning: {len(df_cleaned):,}")

# ── 4. Export ─────────────────────────────────────────────
df_cleaned.to_csv('cleaned_data.csv', index=False)
print("Exported to cleaned_data.csv")

Interview Prep

Cheat sheet — quick definitions to remember

Define
What does df.duplicated() return?

A boolean Series the same length as the DataFrame — True for every row that is a duplicate of a previously seen row, False otherwise. The first occurrence is marked False by default.

Boolean SeriesTrue = duplicateFirst = False by default

Compare
keep=’first’ vs keep=’last’ vs keep=False

first — keeps the first occurrence, drops all subsequent duplicates. last — keeps the final occurrence, useful for updated records. False — drops every occurrence of any duplicated row, leaving only rows that were unique to begin with.

first = keep originallast = keep latestFalse = drop all copies

Explain
What does the subset parameter do?

By default, duplicated() and drop_duplicates() compare all columns. The subset parameter restricts the comparison to specific columns — for example subset=['customer_id'] finds rows with the same customer ID even if other columns differ.

Default = all columnssubset = logical dedup

Gotcha
Why call reset_index(drop=True) after deduplication?

After dropping rows, the DataFrame retains the original row indices — you’d have gaps like 0, 1, 4, 7 instead of 0, 1, 2, 3. reset_index(drop=True) renumbers from 0 continuously. drop=True prevents the old index from being added as a column.

Index gaps after dropreset_index fixes gapsdrop=True prevents extra col

Gotcha
inplace=True vs reassignment — which is preferred?

Prefer reassignment (df_cleaned = df.drop_duplicates()) — it preserves the original DataFrame for comparison and makes pipelines easier to debug. inplace=True modifies the object and returns None, which can cause confusion when chaining operations. Many pandas best-practice guides now recommend avoiding inplace.

Reassignment = saferinplace returns None

Best Practice
How do you handle duplicates in a production data pipeline?

Three layers: (1) Detect and log before removing — store duplicate counts as data quality metrics. (2) Deduplicate at ingestion, not at query time — clean once, use many times. (3) Add a unique constraint in your database or Delta Lake table to prevent duplicates from re-entering at source.

Log before removingClean at ingestionDB unique constraints

Use Case
When should you NOT remove duplicates?

When the repeated rows represent legitimate repeated events — a customer placing the same order twice on different days, a sensor reading the same value consecutively, or audit log entries. Always validate with domain knowledge before dropping. Use subset to deduplicate on business keys, not entire rows.

Repeated events = validUse subset= for business keys

The post Pandas Remove Duplicates appeared first on Vijay Gokarn.

gen-ai Archives - Vijay Gokarn