NLP · Machine Learning · Text Feature Engineering

From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF

Corpus128 real reviews

TechniquesOHE · BoW · TF-IDF

StackPython · sklearn · BeautifulSoup

SourceGitHub ↗

How I took 128 real Amazon product reviews and turned them into features a machine-learning model can actually chew on — and what I learned about where these classical techniques still shine in 2026.

Context

Why bother with “classical” text features at all?

If you have been anywhere near an LLM in the last two years, you have probably heard that “embeddings solved text.” They did — for a lot of problems. But if you are building a spam filter with 100k labelled examples, a BM25-powered search box, a cold-start classifier for a brand-new product line, or a compliance-audited system where a human needs to understand why the model fired — then Bag of Words and TF-IDF are still in the toolbox.

They are fast, deterministic, interpretable, and an honest baseline you should always beat before reaching for a neural model.

Step 1

Get real data — not toy sentences

Every blog post on TF-IDF uses the same three cooked-up sentences about cats and dogs. I wanted the messiness of real user-generated content, so I wrote a BeautifulSoup scraper across ~20 popular ASINs — Echo Dots, AirPods Pro, Kindles, an Apple Watch, a Ninja blender, a PS5 controller, a Nespresso machine, and so on.

128

Real Reviews

Products

3,461

Unique Tokens

Scraper gotchas: Set a real User-Agent header or Amazon returns a stripped page. Anchor on [data-hook="review-body"] inside celwidget blocks — not the div[data-hook="review"] wrapper on the dedicated reviews page. A few reviews came back in Spanish and Arabic — a lovely reminder that real data never matches the shape your slides promised.

Step 2

Clean the text — the boring part that matters most

A review like “I LOVE it!!! Sound is 🔥. Read more” is not something a counting-based model can work with. Each cleaning step kills a specific kind of noise:

Step	What it kills	Why it matters
Lowercase	LOVE vs love	Avoids vocabulary duplicates
Drop “Read more”	Amazon truncation marker	Otherwise becomes one of the most frequent tokens
Strip punctuation / digits	!!!, $199	They rarely help classical models
Tokenize	—	Gives you units to count
Remove stopwords	the, and, is	Appear in every document → no signal
Lemmatize	speakers → speaker	Tightens the vocabulary

After processing: 11,138 tokens spanning a 3,461-word vocabulary. Top words were exactly the product-review clichés you would expect — use, one, like, great, noise, sound, quality — a perfect sanity check.

Step 3

Three ways to turn text into numbers

One-Hot Encoding

OHE · Binary presence

For each review, build a binary vector over the whole vocabulary: 1 if the word appears, 0 otherwise. Simplest thing that works, easiest to explain to a non-technical stakeholder.

⚠ Throws away frequency — “amazing” once and ten times look identical.

Bag of Words

BoW · CountVectorizer

Same vector shape, but store actual counts. A review that hammers on “sound” three times ranks differently from one that drops the word once. Frequency-aware.

⚠ Still order-blind — “not good, very bad” ≈ “good, not very bad”.

TF-IDF

TfidfVectorizer · The trick

Take the BoW count and divide by how common the word is across the whole corpus. Generic words like “good” get pushed toward zero. Rare, distinctive words like “cancellation” stay loud.

✓ Best signal for downstream classifiers.

TF-IDF Formula tfidf(t, d) = tf(t, d) · log( N / (1 + df(t)) )

In my corpus, the highest-IDF words were exactly the long-tail product features that appeared in just one review. The lowest-IDF words were the generic review vocabulary. That is the whole story of TF-IDF in one experiment.

Step 4

The “aha” moment — one review, three lenses

Encode the same review three times and print the top-weighted tokens:

OHE just lists every unique word in the review. No ranking.

BoW surfaces the most repeated words — almost always filler like one, like, use.

TF-IDF surfaces the words this review says that few others do. That is exactly what a downstream classifier wants to see.

Once you have seen this side-by-side even once, you stop reaching for plain BoW unless you have a very specific reason. (Naive Bayes is one — its underlying math prefers raw counts.)

Step 5

Sparsity — the thing nobody warns you about

Every one of my three matrices came out ~98.15% zero. That is normal — reviews are short, vocabularies are long, and most words do not appear in most documents. Two huge practical implications:

Never store these dense. A 1-million-document × 200k-vocab corpus is a 200-billion-cell matrix. It must live in CSR or equivalent compressed form.

Classical pipelines do not scale forever. Once you are in the tens-of-millions-of-documents range, even sparse storage becomes painful — which is one reason industry moved to dense embedding pipelines for web-scale retrieval.

Step 6

A mini sentiment classifier — and a class imbalance lesson

4–5 star = positive, 1–2 star = negative, 3-star dropped. Two models per feature set: Logistic Regression with class_weight="balanced" and Multinomial Naive Bayes.

Headline accuracy looks great — ~97% on the test split. But the test split has 31 positives and 1 negative. The interesting metric is recall on the negative class, and with only five one-star reviews in the whole corpus, no model is going to learn that cleanly. Amazon surfaces highly-rated reviews first, so any pipeline that scrapes top-of-page reviews inherits the same lopsided distribution.

TF-IDF gives Logistic Regression a small, consistent edge by silencing filler words.

Naive Bayes prefers raw BoW counts — rescaling with IDF can actually hurt it.

Never trust a single accuracy number on imbalanced data. Always print per-class precision/recall.

Step 7

Where these techniques break — and where they still win

Scenario	BoW / TF-IDF	Embeddings
Semantic similarity “audio excellent” vs “sound great”	Zero shared tokens → fails	Maps synonyms close ✓
Negation “battery lasts” vs “battery dies”	Near-identical vectors → fails	Directional context ✓
Interpretability	Each feature is a word ✓	1024-dim black box
Training speed	Millions of docs, minutes, laptop ✓	GPU required at scale
Exact keyword / ID retrieval	BM25 still wins ✓	Can miss rare tokens
Cold start (zero labels)	Cosine sim on day one ✓	Needs fine-tuning data

Summary

Key takeaways

Preprocessing is 80% of the game

Before you touch any encoder, understand exactly what “a token” means in your corpus. Lowercase, stopwords, lemmatization — each step has a specific purpose.

Always inspect a single document’s top features

It is the fastest way to develop intuition about what your encoding is actually rewarding. Print OHE vs BoW vs TF-IDF side-by-side at least once.

Watch sparsity and class imbalance

Both will bite you long before modelling choices do. Use CSR storage. Never trust a single accuracy number on skewed data — always check per-class recall.

Know why you would pick the classical tool

If your answer is only “because it is in every tutorial”, reach for an embedding model. If your answer is “interpretability and speed” — BoW/TF-IDF are still excellent choices.

Interview Prep

Cheat sheet — quick definitions to remember

Define
What is One-Hot Encoding in NLP?

Binary presence vector over the vocabulary. 1 if the word appears in the document, 0 otherwise. No frequency, no order. Size = vocabulary length.

Binary: 0 or 1 Ignores frequency Simplest encoder

Define
What is Bag of Words?

Word count vector over the vocabulary. Stores how many times each word appears. Frequency-aware but order-blind — treats a document as an unordered bag of tokens.

Counts, not binary Order-blind CountVectorizer in sklearn

Define
What is TF-IDF and why does it outperform BoW?

Term Frequency × Inverse Document Frequency. Scales BoW counts down for words that appear in many documents. Words like “good” that are everywhere get suppressed; rare words that are distinctive get amplified. Formula: tf(t,d) · log(N / (1 + df(t)))

Rewards rarity Penalises ubiquity TfidfVectorizer

Compare
When would you use BoW over TF-IDF?

Use raw BoW counts with Naive Bayes — its probability estimates are count-based; IDF rescaling can hurt it. Otherwise, TF-IDF almost always gives a better signal for classifiers.

Naive Bayes → BoW Logistic Regression → TF-IDF

Gotcha
What is sparsity and why does it matter?

A BoW/TF-IDF matrix is typically 95–99% zeros because documents are short and vocabularies are large. Always store in sparse format (CSR) — a dense matrix of 1M docs × 200k vocab = 200B cells, which won’t fit in RAM.

98% zeros = normal Always use CSR format

Weakness
What can’t BoW/TF-IDF do that embeddings can?

They are lexical, not semantic. “Audio is excellent” and “sound is great” share zero tokens → zero similarity. “Battery lasts” and “battery dies” share most tokens → high similarity. Embeddings fix both by mapping meaning, not just words.

No synonyms No negation Use embeddings for semantics

Use Case
When do classical methods still win in 2026?

4 scenarios where BoW/TF-IDF beat neural alternatives: (1) exact-match / keyword search — BM25 still outperforms embeddings for identifier queries; (2) interpretability requirements; (3) training speed at millions of documents on a laptop; (4) cold-start with zero labelled data.

BM25 search Interpretability Cold start Speed

From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF

From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF

Why bother with “classical” text features at all?

Get real data — not toy sentences

Clean the text — the boring part that matters most

Three ways to turn text into numbers

One-Hot Encoding

Bag of Words

TF-IDF

The “aha” moment — one review, three lenses

Sparsity — the thing nobody warns you about

A mini sentiment classifier — and a class imbalance lesson

Where these techniques break — and where they still win

Key takeaways

Preprocessing is 80% of the game

Always inspect a single document’s top features

Watch sparsity and class imbalance

Know why you would pick the classical tool

Cheat sheet — quick definitions to remember

GitHub Copilot + VS Code: Tips, Tricks, and Best…

3 Plugins That Actually Organize Your Life — Notion, Todoist…

AI Pre-Trade Analyzer