data-analysis Archives - Vijay Gokarn

From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF

Vijay Gokarn — Sat, 11 Apr 2026 15:02:33 +0000

NLP · Machine Learning · Text Feature Engineering

From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF

Corpus128 real reviews

TechniquesOHE · BoW · TF-IDF

StackPython · sklearn · BeautifulSoup

SourceGitHub ↗

How I took 128 real Amazon product reviews and turned them into features a machine-learning model can actually chew on — and what I learned about where these classical techniques still shine in 2026.

Context

Why bother with “classical” text features at all?

If you have been anywhere near an LLM in the last two years, you have probably heard that “embeddings solved text.” They did — for a lot of problems. But if you are building a spam filter with 100k labelled examples, a BM25-powered search box, a cold-start classifier for a brand-new product line, or a compliance-audited system where a human needs to understand why the model fired — then Bag of Words and TF-IDF are still in the toolbox.

They are fast, deterministic, interpretable, and an honest baseline you should always beat before reaching for a neural model.

Step 1

Get real data — not toy sentences

Every blog post on TF-IDF uses the same three cooked-up sentences about cats and dogs. I wanted the messiness of real user-generated content, so I wrote a BeautifulSoup scraper across ~20 popular ASINs — Echo Dots, AirPods Pro, Kindles, an Apple Watch, a Ninja blender, a PS5 controller, a Nespresso machine, and so on.

128

Real Reviews

Products

3,461

Unique Tokens

Scraper gotchas: Set a real User-Agent header or Amazon returns a stripped page. Anchor on [data-hook="review-body"] inside celwidget blocks — not the div[data-hook="review"] wrapper on the dedicated reviews page. A few reviews came back in Spanish and Arabic — a lovely reminder that real data never matches the shape your slides promised.

Step 2

Clean the text — the boring part that matters most

A review like “I LOVE it!!! Sound is 🔥. Read more” is not something a counting-based model can work with. Each cleaning step kills a specific kind of noise:

Step	What it kills	Why it matters
Lowercase	LOVE vs love	Avoids vocabulary duplicates
Drop “Read more”	Amazon truncation marker	Otherwise becomes one of the most frequent tokens
Strip punctuation / digits	!!!, $199	They rarely help classical models
Tokenize	—	Gives you units to count
Remove stopwords	the, and, is	Appear in every document → no signal
Lemmatize	speakers → speaker	Tightens the vocabulary

After processing: 11,138 tokens spanning a 3,461-word vocabulary. Top words were exactly the product-review clichés you would expect — use, one, like, great, noise, sound, quality — a perfect sanity check.

Step 3

Three ways to turn text into numbers

One-Hot Encoding

OHE · Binary presence

For each review, build a binary vector over the whole vocabulary: 1 if the word appears, 0 otherwise. Simplest thing that works, easiest to explain to a non-technical stakeholder.

⚠ Throws away frequency — “amazing” once and ten times look identical.

Bag of Words

BoW · CountVectorizer

Same vector shape, but store actual counts. A review that hammers on “sound” three times ranks differently from one that drops the word once. Frequency-aware.

⚠ Still order-blind — “not good, very bad” ≈ “good, not very bad”.

TF-IDF

TfidfVectorizer · The trick

Take the BoW count and divide by how common the word is across the whole corpus. Generic words like “good” get pushed toward zero. Rare, distinctive words like “cancellation” stay loud.

✓ Best signal for downstream classifiers.

TF-IDF Formula tfidf(t, d) = tf(t, d) · log( N / (1 + df(t)) )

In my corpus, the highest-IDF words were exactly the long-tail product features that appeared in just one review. The lowest-IDF words were the generic review vocabulary. That is the whole story of TF-IDF in one experiment.

Step 4

The “aha” moment — one review, three lenses

Encode the same review three times and print the top-weighted tokens:

OHE just lists every unique word in the review. No ranking.

BoW surfaces the most repeated words — almost always filler like one, like, use.

TF-IDF surfaces the words this review says that few others do. That is exactly what a downstream classifier wants to see.

Once you have seen this side-by-side even once, you stop reaching for plain BoW unless you have a very specific reason. (Naive Bayes is one — its underlying math prefers raw counts.)

Step 5

Sparsity — the thing nobody warns you about

Every one of my three matrices came out ~98.15% zero. That is normal — reviews are short, vocabularies are long, and most words do not appear in most documents. Two huge practical implications:

Never store these dense. A 1-million-document × 200k-vocab corpus is a 200-billion-cell matrix. It must live in CSR or equivalent compressed form.

Classical pipelines do not scale forever. Once you are in the tens-of-millions-of-documents range, even sparse storage becomes painful — which is one reason industry moved to dense embedding pipelines for web-scale retrieval.

Step 6

A mini sentiment classifier — and a class imbalance lesson

4–5 star = positive, 1–2 star = negative, 3-star dropped. Two models per feature set: Logistic Regression with class_weight="balanced" and Multinomial Naive Bayes.

Headline accuracy looks great — ~97% on the test split. But the test split has 31 positives and 1 negative. The interesting metric is recall on the negative class, and with only five one-star reviews in the whole corpus, no model is going to learn that cleanly. Amazon surfaces highly-rated reviews first, so any pipeline that scrapes top-of-page reviews inherits the same lopsided distribution.

TF-IDF gives Logistic Regression a small, consistent edge by silencing filler words.

Naive Bayes prefers raw BoW counts — rescaling with IDF can actually hurt it.

Never trust a single accuracy number on imbalanced data. Always print per-class precision/recall.

Step 7

Where these techniques break — and where they still win

Scenario	BoW / TF-IDF	Embeddings
Semantic similarity “audio excellent” vs “sound great”	Zero shared tokens → fails	Maps synonyms close ✓
Negation “battery lasts” vs “battery dies”	Near-identical vectors → fails	Directional context ✓
Interpretability	Each feature is a word ✓	1024-dim black box
Training speed	Millions of docs, minutes, laptop ✓	GPU required at scale
Exact keyword / ID retrieval	BM25 still wins ✓	Can miss rare tokens
Cold start (zero labels)	Cosine sim on day one ✓	Needs fine-tuning data

Summary

Key takeaways

Preprocessing is 80% of the game

Before you touch any encoder, understand exactly what “a token” means in your corpus. Lowercase, stopwords, lemmatization — each step has a specific purpose.

Always inspect a single document’s top features

It is the fastest way to develop intuition about what your encoding is actually rewarding. Print OHE vs BoW vs TF-IDF side-by-side at least once.

Watch sparsity and class imbalance

Both will bite you long before modelling choices do. Use CSR storage. Never trust a single accuracy number on skewed data — always check per-class recall.

Know why you would pick the classical tool

If your answer is only “because it is in every tutorial”, reach for an embedding model. If your answer is “interpretability and speed” — BoW/TF-IDF are still excellent choices.

Interview Prep

Cheat sheet — quick definitions to remember

Define
What is One-Hot Encoding in NLP?

Binary presence vector over the vocabulary. 1 if the word appears in the document, 0 otherwise. No frequency, no order. Size = vocabulary length.

Binary: 0 or 1 Ignores frequency Simplest encoder

Define
What is Bag of Words?

Word count vector over the vocabulary. Stores how many times each word appears. Frequency-aware but order-blind — treats a document as an unordered bag of tokens.

Counts, not binary Order-blind CountVectorizer in sklearn

Define
What is TF-IDF and why does it outperform BoW?

Term Frequency × Inverse Document Frequency. Scales BoW counts down for words that appear in many documents. Words like “good” that are everywhere get suppressed; rare words that are distinctive get amplified. Formula: tf(t,d) · log(N / (1 + df(t)))

Rewards rarity Penalises ubiquity TfidfVectorizer

Compare
When would you use BoW over TF-IDF?

Use raw BoW counts with Naive Bayes — its probability estimates are count-based; IDF rescaling can hurt it. Otherwise, TF-IDF almost always gives a better signal for classifiers.

Naive Bayes → BoW Logistic Regression → TF-IDF

Gotcha
What is sparsity and why does it matter?

A BoW/TF-IDF matrix is typically 95–99% zeros because documents are short and vocabularies are large. Always store in sparse format (CSR) — a dense matrix of 1M docs × 200k vocab = 200B cells, which won’t fit in RAM.

98% zeros = normal Always use CSR format

Weakness
What can’t BoW/TF-IDF do that embeddings can?

They are lexical, not semantic. “Audio is excellent” and “sound is great” share zero tokens → zero similarity. “Battery lasts” and “battery dies” share most tokens → high similarity. Embeddings fix both by mapping meaning, not just words.

No synonyms No negation Use embeddings for semantics

Use Case
When do classical methods still win in 2026?

4 scenarios where BoW/TF-IDF beat neural alternatives: (1) exact-match / keyword search — BM25 still outperforms embeddings for identifier queries; (2) interpretability requirements; (3) training speed at millions of documents on a laptop; (4) cold-start with zero labelled data.

BM25 search Interpretability Cold start Speed

The post From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF appeared first on Vijay Gokarn.

Pandas Remove Duplicates

Vijay Gokarn — Tue, 09 Jul 2024 11:12:55 +0000

Data Engineering · Python · Pandas · Data Cleaning

Handling Duplicate Rows in Pandas — Identify, Remove & Export Clean Data

Librarypandas

Methodsduplicated() · drop_duplicates() · reset_index()

OutputCleaned CSV

Stack Python pandas df.duplicated() drop_duplicates() reset_index() to_csv()

Duplicate rows are one of the most common data quality issues — and one of the most damaging to model accuracy and analysis reliability. Pandas gives you precise tools to detect, inspect, and remove duplicates with a single line of code. This guide walks through the full pipeline: load, detect, choose a strategy, clean, and export.

Context

Why duplicates matter

Duplicate rows skew aggregations, inflate record counts, bias ML model training, and produce misleading visualizations. A sales total that counts the same transaction twice, a classifier trained on repeated samples — both produce results that look correct but aren’t. Clean data is the foundation everything else is built on.

keep=’first’

Keep First

Drop all duplicates except the first occurrence. The original record is preserved. Most common default choice.

keep=’last’

Keep Last

Drop all duplicates except the last occurrence. Useful when later records represent updated values.

keep=False

Drop All

Remove every instance of a duplicated row — including the first. Use when any duplicated record is invalid.

Pipeline

The four-step deduplication pipeline

Load

Read the raw CSV into a DataFrame with pd.read_csv().

Detect

Use df.duplicated() to identify and inspect all duplicate rows before touching the data.

Remove

Call drop_duplicates(keep=...) with your chosen strategy. Reset the index for a clean sequential result.

Export

Write the cleaned DataFrame back to CSV with to_csv() for downstream use.

Step 1

Load your dataset

Start by reading your data into a pandas DataFrame. pd.read_csv() is the standard entry point for flat files. From here, all deduplication operations work on the in-memory DataFrame — your source file is never modified.

load_data.pyPython

import pandas as pd

# Read the raw dataset into a DataFrame
df = pd.read_csv('your_data_file.csv')

# Quick shape check before cleaning
print(f"Rows: {df.shape[0]:,}  |  Columns: {df.shape[1]}")

Other sources: The same deduplication logic applies regardless of how you load your data. Use pd.read_excel() for XLSX, pd.read_parquet() for Parquet, or query a database with pd.read_sql() — all return a DataFrame you can clean the same way.

Step 2

Detect & inspect duplicates

df.duplicated() returns a boolean Series — True for every row that is a duplicate of an earlier row. Always inspect before you remove — understanding what the duplicates look like helps you choose the right strategy.

detect_duplicates.pyPython

# Boolean mask: True for every row that is a duplicate
duplicate_mask = df.duplicated()

# How many duplicates exist?
print(f"Duplicate rows found: {duplicate_mask.sum():,}")

# Inspect the duplicate rows themselves
duplicates = df[df.duplicated()]
print(duplicates)

# See ALL occurrences of duplicated rows (including originals)
all_dupes = df[df.duplicated(keep=False)]
print(all_dupes.sort_values(by=df.columns.tolist()))

Subset duplicates: By default duplicated() checks all columns. To flag rows that are duplicates only on specific columns (e.g. same customer_id): df.duplicated(subset=['customer_id']). This is useful for finding logical duplicates even when other columns differ.

Step 3

Remove duplicates — three strategies

drop_duplicates() returns a new DataFrame by default — the original is untouched. The keep parameter controls which occurrence survives. After removing, reset_index(drop=True) gives you a clean sequential index starting from 0.

remove_duplicates.pyPython

# ── Strategy 1: keep the FIRST occurrence (default) ──
df_keep_first = df.drop_duplicates(keep='first')

# ── Strategy 2: keep the LAST occurrence ──
#    useful when later rows represent updated/corrected records
df_keep_last = df.drop_duplicates(keep='last')

# ── Strategy 3: drop ALL occurrences of any duplicated row ──
#    use when any repeated row is invalid data
df_drop_all = df.drop_duplicates(keep=False)

# ── Subset: deduplicate only on specific columns ──
df_subset = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='first')

# ── Reset the index after removal (clean 0-based index) ──
df_cleaned = df_keep_first.reset_index(drop=True, inplace=False)

# Confirm rows removed
print(f"Before: {len(df):,}  |  After: {len(df_cleaned):,}  |  Removed: {len(df) - len(df_cleaned):,}")

inplace vs assignment: drop_duplicates(inplace=True) modifies the DataFrame in place and returns None. Prefer the assignment pattern (df_cleaned = df.drop_duplicates()) — it preserves the original for comparison and makes your code easier to debug.

Step 4

Export the clean data

Write the deduplicated DataFrame back to a CSV. Setting index=False prevents pandas from writing the row index as an extra column — your downstream consumers will thank you.

export.pyPython

# Export to CSV — index=False keeps the file clean
df_cleaned.to_csv('cleaned_data.csv', index=False)

print("Cleaned data exported to cleaned_data.csv")

# Optional: also export to Parquet for better performance at scale
df_cleaned.to_parquet('cleaned_data.parquet', index=False)

Complete Reference

Full deduplication script

Everything in one place — load, detect, remove (keep first), reset index, and export.

deduplicate.py — full scriptPython

import pandas as pd

# ── 1. Load ─────────────────────────────────────────────
df = pd.read_csv('your_data_file.csv')
print(f"Loaded {len(df):,} rows")

# ── 2. Detect ────────────────────────────────────────────
duplicates = df[df.duplicated()]
print(f"Duplicate rows found: {len(duplicates):,}")
print(duplicates)

# ── 3a. Keep last occurrence of each duplicate row ───────
df_cleaned = df.drop_duplicates(keep='last')

# ── 3b. Keep first occurrence (swap in if preferred) ─────
# df_cleaned = df.drop_duplicates(keep='first')

# ── 3c. Reset the index to a clean 0-based sequence ──────
df_cleaned.reset_index(drop=True, inplace=True)

print(f"Rows after cleaning: {len(df_cleaned):,}")

# ── 4. Export ─────────────────────────────────────────────
df_cleaned.to_csv('cleaned_data.csv', index=False)
print("Exported to cleaned_data.csv")

Interview Prep

Cheat sheet — quick definitions to remember

Define
What does df.duplicated() return?

A boolean Series the same length as the DataFrame — True for every row that is a duplicate of a previously seen row, False otherwise. The first occurrence is marked False by default.

Boolean SeriesTrue = duplicateFirst = False by default

Compare
keep=’first’ vs keep=’last’ vs keep=False

first — keeps the first occurrence, drops all subsequent duplicates. last — keeps the final occurrence, useful for updated records. False — drops every occurrence of any duplicated row, leaving only rows that were unique to begin with.

first = keep originallast = keep latestFalse = drop all copies

Explain
What does the subset parameter do?

By default, duplicated() and drop_duplicates() compare all columns. The subset parameter restricts the comparison to specific columns — for example subset=['customer_id'] finds rows with the same customer ID even if other columns differ.

Default = all columnssubset = logical dedup

Gotcha
Why call reset_index(drop=True) after deduplication?

After dropping rows, the DataFrame retains the original row indices — you’d have gaps like 0, 1, 4, 7 instead of 0, 1, 2, 3. reset_index(drop=True) renumbers from 0 continuously. drop=True prevents the old index from being added as a column.

Index gaps after dropreset_index fixes gapsdrop=True prevents extra col

Gotcha
inplace=True vs reassignment — which is preferred?

Prefer reassignment (df_cleaned = df.drop_duplicates()) — it preserves the original DataFrame for comparison and makes pipelines easier to debug. inplace=True modifies the object and returns None, which can cause confusion when chaining operations. Many pandas best-practice guides now recommend avoiding inplace.

Reassignment = saferinplace returns None

Best Practice
How do you handle duplicates in a production data pipeline?

Three layers: (1) Detect and log before removing — store duplicate counts as data quality metrics. (2) Deduplicate at ingestion, not at query time — clean once, use many times. (3) Add a unique constraint in your database or Delta Lake table to prevent duplicates from re-entering at source.

Log before removingClean at ingestionDB unique constraints

Use Case
When should you NOT remove duplicates?

When the repeated rows represent legitimate repeated events — a customer placing the same order twice on different days, a sensor reading the same value consecutively, or audit log entries. Always validate with domain knowledge before dropping. Use subset to deduplicate on business keys, not entire rows.

Repeated events = validUse subset= for business keys

The post Pandas Remove Duplicates appeared first on Vijay Gokarn.