pandas Archives - Vijay Gokarn

From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF

Vijay Gokarn — Sat, 11 Apr 2026 15:02:33 +0000

NLP · Machine Learning · Text Feature Engineering

From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF

Corpus128 real reviews

TechniquesOHE · BoW · TF-IDF

StackPython · sklearn · BeautifulSoup

SourceGitHub ↗

How I took 128 real Amazon product reviews and turned them into features a machine-learning model can actually chew on — and what I learned about where these classical techniques still shine in 2026.

Context

Why bother with “classical” text features at all?

If you have been anywhere near an LLM in the last two years, you have probably heard that “embeddings solved text.” They did — for a lot of problems. But if you are building a spam filter with 100k labelled examples, a BM25-powered search box, a cold-start classifier for a brand-new product line, or a compliance-audited system where a human needs to understand why the model fired — then Bag of Words and TF-IDF are still in the toolbox.

They are fast, deterministic, interpretable, and an honest baseline you should always beat before reaching for a neural model.

Step 1

Get real data — not toy sentences

Every blog post on TF-IDF uses the same three cooked-up sentences about cats and dogs. I wanted the messiness of real user-generated content, so I wrote a BeautifulSoup scraper across ~20 popular ASINs — Echo Dots, AirPods Pro, Kindles, an Apple Watch, a Ninja blender, a PS5 controller, a Nespresso machine, and so on.

128

Real Reviews

Products

3,461

Unique Tokens

Scraper gotchas: Set a real User-Agent header or Amazon returns a stripped page. Anchor on [data-hook="review-body"] inside celwidget blocks — not the div[data-hook="review"] wrapper on the dedicated reviews page. A few reviews came back in Spanish and Arabic — a lovely reminder that real data never matches the shape your slides promised.

Step 2

Clean the text — the boring part that matters most

A review like “I LOVE it!!! Sound is 🔥. Read more” is not something a counting-based model can work with. Each cleaning step kills a specific kind of noise:

Step	What it kills	Why it matters
Lowercase	LOVE vs love	Avoids vocabulary duplicates
Drop “Read more”	Amazon truncation marker	Otherwise becomes one of the most frequent tokens
Strip punctuation / digits	!!!, $199	They rarely help classical models
Tokenize	—	Gives you units to count
Remove stopwords	the, and, is	Appear in every document → no signal
Lemmatize	speakers → speaker	Tightens the vocabulary

After processing: 11,138 tokens spanning a 3,461-word vocabulary. Top words were exactly the product-review clichés you would expect — use, one, like, great, noise, sound, quality — a perfect sanity check.

Step 3

Three ways to turn text into numbers

One-Hot Encoding

OHE · Binary presence

For each review, build a binary vector over the whole vocabulary: 1 if the word appears, 0 otherwise. Simplest thing that works, easiest to explain to a non-technical stakeholder.

⚠ Throws away frequency — “amazing” once and ten times look identical.

Bag of Words

BoW · CountVectorizer

Same vector shape, but store actual counts. A review that hammers on “sound” three times ranks differently from one that drops the word once. Frequency-aware.

⚠ Still order-blind — “not good, very bad” ≈ “good, not very bad”.

TF-IDF

TfidfVectorizer · The trick

Take the BoW count and divide by how common the word is across the whole corpus. Generic words like “good” get pushed toward zero. Rare, distinctive words like “cancellation” stay loud.

✓ Best signal for downstream classifiers.

TF-IDF Formula tfidf(t, d) = tf(t, d) · log( N / (1 + df(t)) )

In my corpus, the highest-IDF words were exactly the long-tail product features that appeared in just one review. The lowest-IDF words were the generic review vocabulary. That is the whole story of TF-IDF in one experiment.

Step 4

The “aha” moment — one review, three lenses

Encode the same review three times and print the top-weighted tokens:

OHE just lists every unique word in the review. No ranking.

BoW surfaces the most repeated words — almost always filler like one, like, use.

TF-IDF surfaces the words this review says that few others do. That is exactly what a downstream classifier wants to see.

Once you have seen this side-by-side even once, you stop reaching for plain BoW unless you have a very specific reason. (Naive Bayes is one — its underlying math prefers raw counts.)

Step 5

Sparsity — the thing nobody warns you about

Every one of my three matrices came out ~98.15% zero. That is normal — reviews are short, vocabularies are long, and most words do not appear in most documents. Two huge practical implications:

Never store these dense. A 1-million-document × 200k-vocab corpus is a 200-billion-cell matrix. It must live in CSR or equivalent compressed form.

Classical pipelines do not scale forever. Once you are in the tens-of-millions-of-documents range, even sparse storage becomes painful — which is one reason industry moved to dense embedding pipelines for web-scale retrieval.

Step 6

A mini sentiment classifier — and a class imbalance lesson

4–5 star = positive, 1–2 star = negative, 3-star dropped. Two models per feature set: Logistic Regression with class_weight="balanced" and Multinomial Naive Bayes.

Headline accuracy looks great — ~97% on the test split. But the test split has 31 positives and 1 negative. The interesting metric is recall on the negative class, and with only five one-star reviews in the whole corpus, no model is going to learn that cleanly. Amazon surfaces highly-rated reviews first, so any pipeline that scrapes top-of-page reviews inherits the same lopsided distribution.

TF-IDF gives Logistic Regression a small, consistent edge by silencing filler words.

Naive Bayes prefers raw BoW counts — rescaling with IDF can actually hurt it.

Never trust a single accuracy number on imbalanced data. Always print per-class precision/recall.

Step 7

Where these techniques break — and where they still win

Scenario	BoW / TF-IDF	Embeddings
Semantic similarity “audio excellent” vs “sound great”	Zero shared tokens → fails	Maps synonyms close ✓
Negation “battery lasts” vs “battery dies”	Near-identical vectors → fails	Directional context ✓
Interpretability	Each feature is a word ✓	1024-dim black box
Training speed	Millions of docs, minutes, laptop ✓	GPU required at scale
Exact keyword / ID retrieval	BM25 still wins ✓	Can miss rare tokens
Cold start (zero labels)	Cosine sim on day one ✓	Needs fine-tuning data

Summary

Key takeaways

Preprocessing is 80% of the game

Before you touch any encoder, understand exactly what “a token” means in your corpus. Lowercase, stopwords, lemmatization — each step has a specific purpose.

Always inspect a single document’s top features

It is the fastest way to develop intuition about what your encoding is actually rewarding. Print OHE vs BoW vs TF-IDF side-by-side at least once.

Watch sparsity and class imbalance

Both will bite you long before modelling choices do. Use CSR storage. Never trust a single accuracy number on skewed data — always check per-class recall.

Know why you would pick the classical tool

If your answer is only “because it is in every tutorial”, reach for an embedding model. If your answer is “interpretability and speed” — BoW/TF-IDF are still excellent choices.

Interview Prep

Cheat sheet — quick definitions to remember

Define
What is One-Hot Encoding in NLP?

Binary presence vector over the vocabulary. 1 if the word appears in the document, 0 otherwise. No frequency, no order. Size = vocabulary length.

Binary: 0 or 1 Ignores frequency Simplest encoder

Define
What is Bag of Words?

Word count vector over the vocabulary. Stores how many times each word appears. Frequency-aware but order-blind — treats a document as an unordered bag of tokens.

Counts, not binary Order-blind CountVectorizer in sklearn

Define
What is TF-IDF and why does it outperform BoW?

Term Frequency × Inverse Document Frequency. Scales BoW counts down for words that appear in many documents. Words like “good” that are everywhere get suppressed; rare words that are distinctive get amplified. Formula: tf(t,d) · log(N / (1 + df(t)))

Rewards rarity Penalises ubiquity TfidfVectorizer

Compare
When would you use BoW over TF-IDF?

Use raw BoW counts with Naive Bayes — its probability estimates are count-based; IDF rescaling can hurt it. Otherwise, TF-IDF almost always gives a better signal for classifiers.

Naive Bayes → BoW Logistic Regression → TF-IDF

Gotcha
What is sparsity and why does it matter?

A BoW/TF-IDF matrix is typically 95–99% zeros because documents are short and vocabularies are large. Always store in sparse format (CSR) — a dense matrix of 1M docs × 200k vocab = 200B cells, which won’t fit in RAM.

98% zeros = normal Always use CSR format

Weakness
What can’t BoW/TF-IDF do that embeddings can?

They are lexical, not semantic. “Audio is excellent” and “sound is great” share zero tokens → zero similarity. “Battery lasts” and “battery dies” share most tokens → high similarity. Embeddings fix both by mapping meaning, not just words.

No synonyms No negation Use embeddings for semantics

Use Case
When do classical methods still win in 2026?

4 scenarios where BoW/TF-IDF beat neural alternatives: (1) exact-match / keyword search — BM25 still outperforms embeddings for identifier queries; (2) interpretability requirements; (3) training speed at millions of documents on a laptop; (4) cold-start with zero labelled data.

BM25 search Interpretability Cold start Speed

The post From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF appeared first on Vijay Gokarn.

Analyzing Wikipedia Articles with Langchain and OpenAI in Databricks

Vijay Gokarn — Tue, 16 Jul 2024 10:59:28 +0000

GenAI Mastery Series · NLP · Databricks · LangChain

Categorizing Wikipedia at Scale with OpenAI, LangChain & Databricks

Datasetwikimedia/wikipedia · 10,000 articles

ModelChatOpenAI (GPT-4)

Output50-category JSON classifier

Stack Databricks Notebook LangChain Core langchain_openai HuggingFace Datasets ChatPromptTemplate Batch Inference JSON Parsing

A complete walkthrough of a large-scale text classification pipeline built inside a Databricks notebook — from loading 10,000 Wikipedia articles to batch-classifying them into 50 categories using OpenAI’s language model via LangChain. Every step includes the real working code.

Prerequisites

Databricks Account Python (basic) OpenAI API Key HuggingFace Access

Overview

Pipeline architecture

The full pipeline runs end-to-end inside a single Databricks notebook. Wikipedia articles are loaded from HuggingFace, cleaned to first-line summaries, batched, and sent to GPT-4 via LangChain’s chain interface. Responses are parsed from JSON into a DataFrame.

📦

HuggingFace

wikimedia/wikipedia dataset

→

✂️

Clean

First-line extraction

→

⛓

LangChain

Prompt + ChatOpenAI

→

🔄

Batch (8)

Rate-limit safe

→

📊

DataFrame

id + category

Step 1

Install required packages

In a Databricks notebook, use %pip magic commands to install packages into the cluster. The %restart_python command refreshes the interpreter to pick up the new packages without restarting the whole cluster.

Databricks Notebook — Cell 1Python / Magic

%pip install langchain_openai
%pip install --upgrade langchain_core langchain_openai

%restart_python

Step 2

Import libraries

Standard Python utilities (json, time, os) combined with LangChain for the LLM interface, HuggingFace Datasets for Wikipedia data loading, and tqdm for progress visibility during batch processing.

Databricks Notebook — Cell 2Python

import json
import time
import os
import getpass
import pandas as pd

from datasets import Dataset, load_dataset
from tqdm import tqdm
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

Step 3

Load & clean the dataset

The HuggingFace wikimedia/wikipedia dataset is massive — we take a 10,000 article slice from the English November 2023 snapshot. The cleaning step extracts only the first line of each article (the summary sentence), which is sufficient for category classification and drastically reduces token usage.

Databricks Notebook — Cell 3Python

# Load the Wikipedia English dataset (Nov 2023 snapshot)
dataset = load_dataset("wikimedia/wikipedia", "20231101.en")

# Take a 10k article sample
NUM_SAMPLES = 10000
articles = dataset["train"][:NUM_SAMPLES]["text"]
ids      = dataset["train"][:NUM_SAMPLES]["id"]

# Clean: keep only the first line (article summary) to reduce tokens
articles = [x.split("\n")[0] for x in articles]

# Sanity check
print(len(articles))   # → 10000
print(articles[99])    # inspect a sample article

Why first line only? Wikipedia article summaries are dense and self-contained. Using the full article would cost ~10–50x more tokens per classification with minimal accuracy gain. At 10k articles × avg 150 tokens = ~1.5M input tokens — already significant. First-line only brings that to ~200k tokens.

Step 4

Configure OpenAI + LangChain

Use getpass to securely prompt for the API key without echoing it to the notebook output. Then initialize ChatOpenAI — LangChain’s wrapper around the OpenAI Chat Completions API.

Databricks Notebook — Cell 4 & 5Python

# Securely enter API key (won't echo to notebook output)
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

# Initialize the LangChain ChatOpenAI wrapper
llm = ChatOpenAI()
print(llm.model_name)  # → "gpt-3.5-turbo" (default) or your configured model

Step 5 — Core Logic

Define the prompt template

The ChatPromptTemplate structures the conversation: a system message sets the classification task with all 50 categories, and the human message carries the article payload. The double curly braces {{ }} in the JSON schema escape the literal braces so LangChain doesn’t treat them as template variables.

Databricks Notebook — Cell 6Python

prompt = ChatPromptTemplate.from_messages([
    ("system", """Your task is to assess the article and categorize it
into one of the following predefined categories:

'History', 'Geography', 'Science', 'Technology', 'Mathematics',
'Literature', 'Art', 'Music', 'Film', 'Television', 'Sports',
'Politics', 'Philosophy', 'Religion', 'Sociology', 'Psychology',
'Economics', 'Business', 'Medicine', 'Biology', 'Chemistry',
'Physics', 'Astronomy', 'Environmental Science', 'Engineering',
'Computer Science', 'Linguistics', 'Anthropology', 'Archaeology',
'Education', 'Law', 'Military', 'Architecture', 'Fashion',
'Cuisine', 'Travel', 'Mythology', 'Folklore', 'Biography',
'Social Issues', 'Human Rights', 'Technology Ethics',
'Climate Change', 'Conservation', 'Urban Studies', 'Demographics',
'Journalism', 'Cryptocurrency', 'Artificial Intelligence'

Output ONLY a JSON object — no extra text:
{{
    "id": string,
    "category": string
}}"""),
    ("human", "{input}")
])

Prompt engineering note: Listing all valid categories explicitly in the system prompt constrains the model to valid outputs — reducing hallucinated or free-form category names. The strict JSON output instruction combined with downstream json.loads() parsing creates a simple but robust structured output pipeline.

Step 6

Build the chain & test it

LangChain’s pipe operator | composes the prompt template and the LLM into a reusable chain. One call to .invoke() with a single article validates the whole setup before committing to batch processing.

Databricks Notebook — Cell 7Python

# Compose prompt → llm into a reusable chain
chain = prompt | llm

# Test with article[0] before running the full batch
content  = json.dumps({"id": ids[0], "article": articles[0]})
response = chain.invoke(content)
print(response.content)
# → {"id": "1", "category": "History"}

Step 7 — Core Loop

Batch processing with rate-limit handling

Processing 1,000 articles one-by-one would quickly hit OpenAI’s requests-per-minute limit. The solution: accumulate inputs into batches of 8 and call .batch() with a 1.5-second sleep between each batch. tqdm wraps the loop to give live progress in the notebook.

Databricks Notebook — Cell 8Python

results    = []
BATCH_SIZE = 8
inputs     = []

for index, article in tqdm(enumerate(articles[:1000])):

    inputs.append(
        json.dumps({"id": ids[index], "article": articles[index]})
    )

    if len(inputs) == BATCH_SIZE:
        time.sleep(1.5)            # respect rate limits
        response = chain.batch(inputs)
        results += response
        inputs   = []               # reset buffer

# Flush any remaining articles in the last partial batch
if inputs:
    response = chain.batch(inputs)
    results += response

Rate limit strategy: Batch size 8 + 1.5s sleep = ~5 batches/sec = ~40 requests/sec. For the free OpenAI tier (3 RPM), reduce batch size to 1 and increase sleep to 20s. For production use, implement exponential backoff with tenacity.

Step 8

Parse results into a DataFrame

Not every LLM response will be valid JSON — network hiccups, model refusals, and malformed outputs all happen at scale. The pattern below separates successful parses from failures so you can inspect and retry the failures without losing the successful results.

Databricks Notebook — Cell 9Python

success = []
failure = []

for output in results:
    content = output.content
    try:
        content = json.loads(content)
        success.append(content)
    except ValueError as e:
        failure.append(content)  # keep for retry / inspection

print(f"Success: {len(success)} | Failure: {len(failure)}")

# Convert to DataFrame for analysis / export
df = pd.DataFrame(success)
df.head(10)

Sample Output

What the pipeline produces

10k

Articles Loaded

Articles Classified

Cheat sheet — quick definitions to remember

Define
What is LangChain and what problem does it solve?

A framework for composing LLM-powered applications from modular building blocks — prompts, models, chains, memory, tools, and agents. It solves the orchestration problem: how do you connect a prompt template to an LLM, parse the output, and chain multiple steps together cleanly?

Prompt + LLM + OutputComposable chainsPipe operator |

Explain
What is a ChatPromptTemplate?

A reusable message template that structures the conversation for a chat model. Defines the system role (task instructions) and the human turn (variable input). The {input} placeholder gets filled at runtime. Separating instructions from data is a core prompt engineering best practice.

System = instructionsHuman = data{input} placeholder

Explain
Why use .batch() instead of looping .invoke()?

.batch() sends multiple requests concurrently using asyncio under the hood, while .invoke() is sequential. For 8 articles, batch is roughly 8x faster. The sleep between batches manages rate limits — you get concurrency within a batch, pacing across batches.

Concurrent within batchSleep between batches8x throughput gain

Gotcha
Why separate success and failure lists instead of crashing on parse error?

At 1,000+ LLM calls, some will fail — network timeouts, content policy refusals, or models that occasionally output extra text before the JSON. A try/except pattern collects failures without losing the successful results. Failures can be inspected and retried separately.

Never crash on parse errorInspect failures separatelyRetry pattern

Best Practice
How do you get reliable structured JSON from an LLM?

Three layers: (1) Constrain in the prompt — list valid values, specify exact schema, say “output ONLY JSON”. (2) Use LangChain’s output parsers (JsonOutputParser) for automatic parsing and retry. (3) Validate with Pydantic — define a model and parse the JSON through it to catch type errors.

Constrain schema in promptJsonOutputParserPydantic validation

Explain
Why use Databricks for this pipeline?

Databricks provides a managed Spark + Python environment that scales horizontally. For 10k–10M articles, you can parallelize across a cluster using Spark UDFs or pandas_udf. It also integrates with Delta Lake for storing results, MLflow for experiment tracking, and Unity Catalog for data governance.

Horizontal scaleDelta Lake storageMLflow tracking

Improve
How would you scale this to 10 million articles?

Three upgrades: (1) Wrap the chain call in a Spark pandas_udf so it runs in parallel across the cluster. (2) Replace time.sleep() with exponential backoff via tenacity. (3) Use LangChain’s async batch with chain.abatch() and asyncio for maximum concurrency per node.

Spark pandas_udfchain.abatch()tenacity backoff

The post Analyzing Wikipedia Articles with Langchain and OpenAI in Databricks appeared first on Vijay Gokarn.

Pandas Remove Duplicates

Vijay Gokarn — Tue, 09 Jul 2024 11:12:55 +0000

Data Engineering · Python · Pandas · Data Cleaning

Handling Duplicate Rows in Pandas — Identify, Remove & Export Clean Data

Librarypandas

Methodsduplicated() · drop_duplicates() · reset_index()

OutputCleaned CSV

Stack Python pandas df.duplicated() drop_duplicates() reset_index() to_csv()

Duplicate rows are one of the most common data quality issues — and one of the most damaging to model accuracy and analysis reliability. Pandas gives you precise tools to detect, inspect, and remove duplicates with a single line of code. This guide walks through the full pipeline: load, detect, choose a strategy, clean, and export.

Context

Why duplicates matter

Duplicate rows skew aggregations, inflate record counts, bias ML model training, and produce misleading visualizations. A sales total that counts the same transaction twice, a classifier trained on repeated samples — both produce results that look correct but aren’t. Clean data is the foundation everything else is built on.

keep=’first’

Keep First

Drop all duplicates except the first occurrence. The original record is preserved. Most common default choice.

keep=’last’

Keep Last

Drop all duplicates except the last occurrence. Useful when later records represent updated values.

keep=False

Drop All

Remove every instance of a duplicated row — including the first. Use when any duplicated record is invalid.

Pipeline

The four-step deduplication pipeline

Load

Read the raw CSV into a DataFrame with pd.read_csv().

Detect

Use df.duplicated() to identify and inspect all duplicate rows before touching the data.

Remove

Call drop_duplicates(keep=...) with your chosen strategy. Reset the index for a clean sequential result.

Export

Write the cleaned DataFrame back to CSV with to_csv() for downstream use.

Step 1

Load your dataset

Start by reading your data into a pandas DataFrame. pd.read_csv() is the standard entry point for flat files. From here, all deduplication operations work on the in-memory DataFrame — your source file is never modified.

load_data.pyPython

import pandas as pd

# Read the raw dataset into a DataFrame
df = pd.read_csv('your_data_file.csv')

# Quick shape check before cleaning
print(f"Rows: {df.shape[0]:,}  |  Columns: {df.shape[1]}")

Other sources: The same deduplication logic applies regardless of how you load your data. Use pd.read_excel() for XLSX, pd.read_parquet() for Parquet, or query a database with pd.read_sql() — all return a DataFrame you can clean the same way.

Step 2

Detect & inspect duplicates

df.duplicated() returns a boolean Series — True for every row that is a duplicate of an earlier row. Always inspect before you remove — understanding what the duplicates look like helps you choose the right strategy.

detect_duplicates.pyPython

# Boolean mask: True for every row that is a duplicate
duplicate_mask = df.duplicated()

# How many duplicates exist?
print(f"Duplicate rows found: {duplicate_mask.sum():,}")

# Inspect the duplicate rows themselves
duplicates = df[df.duplicated()]
print(duplicates)

# See ALL occurrences of duplicated rows (including originals)
all_dupes = df[df.duplicated(keep=False)]
print(all_dupes.sort_values(by=df.columns.tolist()))

Subset duplicates: By default duplicated() checks all columns. To flag rows that are duplicates only on specific columns (e.g. same customer_id): df.duplicated(subset=['customer_id']). This is useful for finding logical duplicates even when other columns differ.

Step 3

Remove duplicates — three strategies

drop_duplicates() returns a new DataFrame by default — the original is untouched. The keep parameter controls which occurrence survives. After removing, reset_index(drop=True) gives you a clean sequential index starting from 0.

remove_duplicates.pyPython

# ── Strategy 1: keep the FIRST occurrence (default) ──
df_keep_first = df.drop_duplicates(keep='first')

# ── Strategy 2: keep the LAST occurrence ──
#    useful when later rows represent updated/corrected records
df_keep_last = df.drop_duplicates(keep='last')

# ── Strategy 3: drop ALL occurrences of any duplicated row ──
#    use when any repeated row is invalid data
df_drop_all = df.drop_duplicates(keep=False)

# ── Subset: deduplicate only on specific columns ──
df_subset = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='first')

# ── Reset the index after removal (clean 0-based index) ──
df_cleaned = df_keep_first.reset_index(drop=True, inplace=False)

# Confirm rows removed
print(f"Before: {len(df):,}  |  After: {len(df_cleaned):,}  |  Removed: {len(df) - len(df_cleaned):,}")

inplace vs assignment: drop_duplicates(inplace=True) modifies the DataFrame in place and returns None. Prefer the assignment pattern (df_cleaned = df.drop_duplicates()) — it preserves the original for comparison and makes your code easier to debug.

Step 4

Export the clean data

Write the deduplicated DataFrame back to a CSV. Setting index=False prevents pandas from writing the row index as an extra column — your downstream consumers will thank you.

export.pyPython

# Export to CSV — index=False keeps the file clean
df_cleaned.to_csv('cleaned_data.csv', index=False)

print("Cleaned data exported to cleaned_data.csv")

# Optional: also export to Parquet for better performance at scale
df_cleaned.to_parquet('cleaned_data.parquet', index=False)

Complete Reference

Full deduplication script

Everything in one place — load, detect, remove (keep first), reset index, and export.

deduplicate.py — full scriptPython

import pandas as pd

# ── 1. Load ─────────────────────────────────────────────
df = pd.read_csv('your_data_file.csv')
print(f"Loaded {len(df):,} rows")

# ── 2. Detect ────────────────────────────────────────────
duplicates = df[df.duplicated()]
print(f"Duplicate rows found: {len(duplicates):,}")
print(duplicates)

# ── 3a. Keep last occurrence of each duplicate row ───────
df_cleaned = df.drop_duplicates(keep='last')

# ── 3b. Keep first occurrence (swap in if preferred) ─────
# df_cleaned = df.drop_duplicates(keep='first')

# ── 3c. Reset the index to a clean 0-based sequence ──────
df_cleaned.reset_index(drop=True, inplace=True)

print(f"Rows after cleaning: {len(df_cleaned):,}")

# ── 4. Export ─────────────────────────────────────────────
df_cleaned.to_csv('cleaned_data.csv', index=False)
print("Exported to cleaned_data.csv")

Interview Prep

Cheat sheet — quick definitions to remember

Define
What does df.duplicated() return?

A boolean Series the same length as the DataFrame — True for every row that is a duplicate of a previously seen row, False otherwise. The first occurrence is marked False by default.

Boolean SeriesTrue = duplicateFirst = False by default

Compare
keep=’first’ vs keep=’last’ vs keep=False

first — keeps the first occurrence, drops all subsequent duplicates. last — keeps the final occurrence, useful for updated records. False — drops every occurrence of any duplicated row, leaving only rows that were unique to begin with.

first = keep originallast = keep latestFalse = drop all copies

Explain
What does the subset parameter do?

By default, duplicated() and drop_duplicates() compare all columns. The subset parameter restricts the comparison to specific columns — for example subset=['customer_id'] finds rows with the same customer ID even if other columns differ.

Default = all columnssubset = logical dedup

Gotcha
Why call reset_index(drop=True) after deduplication?

After dropping rows, the DataFrame retains the original row indices — you’d have gaps like 0, 1, 4, 7 instead of 0, 1, 2, 3. reset_index(drop=True) renumbers from 0 continuously. drop=True prevents the old index from being added as a column.

Index gaps after dropreset_index fixes gapsdrop=True prevents extra col

Gotcha
inplace=True vs reassignment — which is preferred?

Prefer reassignment (df_cleaned = df.drop_duplicates()) — it preserves the original DataFrame for comparison and makes pipelines easier to debug. inplace=True modifies the object and returns None, which can cause confusion when chaining operations. Many pandas best-practice guides now recommend avoiding inplace.

Reassignment = saferinplace returns None

Best Practice
How do you handle duplicates in a production data pipeline?

Three layers: (1) Detect and log before removing — store duplicate counts as data quality metrics. (2) Deduplicate at ingestion, not at query time — clean once, use many times. (3) Add a unique constraint in your database or Delta Lake table to prevent duplicates from re-entering at source.

Log before removingClean at ingestionDB unique constraints

Use Case
When should you NOT remove duplicates?

When the repeated rows represent legitimate repeated events — a customer placing the same order twice on different days, a sensor reading the same value consecutively, or audit log entries. Always validate with domain knowledge before dropping. Use subset to deduplicate on business keys, not entire rows.

Repeated events = validUse subset= for business keys

The post Pandas Remove Duplicates appeared first on Vijay Gokarn.