From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF

8 min read

NLP · Machine Learning · Text Feature Engineering

From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF

How I took 128 real Amazon product reviews and turned them into features a machine-learning model can actually chew on — and what I learned about where these classical techniques still shine in 2026.

Context

Why bother with “classical” text features at all?

If you have been anywhere near an LLM in the last two years, you have probably heard that “embeddings solved text.” They did — for a lot of problems. But if you are building a spam filter with 100k labelled examples, a BM25-powered search box, a cold-start classifier for a brand-new product line, or a compliance-audited system where a human needs to understand why the model fired — then Bag of Words and TF-IDF are still in the toolbox.

They are fast, deterministic, interpretable, and an honest baseline you should always beat before reaching for a neural model.


Step 1

Get real data — not toy sentences

Every blog post on TF-IDF uses the same three cooked-up sentences about cats and dogs. I wanted the messiness of real user-generated content, so I wrote a BeautifulSoup scraper across ~20 popular ASINs — Echo Dots, AirPods Pro, Kindles, an Apple Watch, a Ninja blender, a PS5 controller, a Nespresso machine, and so on.

128
Real Reviews
14
Products
3,461
Unique Tokens
Scraper gotchas: Set a real User-Agent header or Amazon returns a stripped page. Anchor on [data-hook="review-body"] inside celwidget blocks — not the div[data-hook="review"] wrapper on the dedicated reviews page. A few reviews came back in Spanish and Arabic — a lovely reminder that real data never matches the shape your slides promised.

Step 2

Clean the text — the boring part that matters most

A review like “I LOVE it!!! Sound is 🔥. Read more” is not something a counting-based model can work with. Each cleaning step kills a specific kind of noise:

StepWhat it killsWhy it matters
LowercaseLOVE vs loveAvoids vocabulary duplicates
Drop “Read more”Amazon truncation markerOtherwise becomes one of the most frequent tokens
Strip punctuation / digits!!!, $199They rarely help classical models
TokenizeGives you units to count
Remove stopwordsthe, and, isAppear in every document → no signal
Lemmatizespeakers → speakerTightens the vocabulary

After processing: 11,138 tokens spanning a 3,461-word vocabulary. Top words were exactly the product-review clichés you would expect — use, one, like, great, noise, sound, quality — a perfect sanity check.


Step 3

Three ways to turn text into numbers

One-Hot Encoding

OHE · Binary presence

For each review, build a binary vector over the whole vocabulary: 1 if the word appears, 0 otherwise. Simplest thing that works, easiest to explain to a non-technical stakeholder.

⚠ Throws away frequency — “amazing” once and ten times look identical.

Bag of Words

BoW · CountVectorizer

Same vector shape, but store actual counts. A review that hammers on “sound” three times ranks differently from one that drops the word once. Frequency-aware.

⚠ Still order-blind — “not good, very bad” ≈ “good, not very bad”.

TF-IDF

TfidfVectorizer · The trick

Take the BoW count and divide by how common the word is across the whole corpus. Generic words like “good” get pushed toward zero. Rare, distinctive words like “cancellation” stay loud.

✓ Best signal for downstream classifiers.

TF-IDF Formula tfidf(t, d) = tf(t, d) · log( N / (1 + df(t)) )

In my corpus, the highest-IDF words were exactly the long-tail product features that appeared in just one review. The lowest-IDF words were the generic review vocabulary. That is the whole story of TF-IDF in one experiment.


Step 4

The “aha” moment — one review, three lenses

Encode the same review three times and print the top-weighted tokens:

OHE just lists every unique word in the review. No ranking.

BoW surfaces the most repeated words — almost always filler like one, like, use.

TF-IDF surfaces the words this review says that few others do. That is exactly what a downstream classifier wants to see.

Once you have seen this side-by-side even once, you stop reaching for plain BoW unless you have a very specific reason. (Naive Bayes is one — its underlying math prefers raw counts.)

Step 5

Sparsity — the thing nobody warns you about

Every one of my three matrices came out ~98.15% zero. That is normal — reviews are short, vocabularies are long, and most words do not appear in most documents. Two huge practical implications:

Never store these dense. A 1-million-document × 200k-vocab corpus is a 200-billion-cell matrix. It must live in CSR or equivalent compressed form.

Classical pipelines do not scale forever. Once you are in the tens-of-millions-of-documents range, even sparse storage becomes painful — which is one reason industry moved to dense embedding pipelines for web-scale retrieval.

Step 6

A mini sentiment classifier — and a class imbalance lesson

4–5 star = positive, 1–2 star = negative, 3-star dropped. Two models per feature set: Logistic Regression with class_weight="balanced" and Multinomial Naive Bayes.

Headline accuracy looks great — ~97% on the test split. But the test split has 31 positives and 1 negative. The interesting metric is recall on the negative class, and with only five one-star reviews in the whole corpus, no model is going to learn that cleanly. Amazon surfaces highly-rated reviews first, so any pipeline that scrapes top-of-page reviews inherits the same lopsided distribution.

TF-IDF gives Logistic Regression a small, consistent edge by silencing filler words.

Naive Bayes prefers raw BoW counts — rescaling with IDF can actually hurt it.

Never trust a single accuracy number on imbalanced data. Always print per-class precision/recall.

Step 7

Where these techniques break — and where they still win

ScenarioBoW / TF-IDFEmbeddings
Semantic similarity
“audio excellent” vs “sound great”
Zero shared tokens → fails Maps synonyms close ✓
Negation
“battery lasts” vs “battery dies”
Near-identical vectors → fails Directional context ✓
Interpretability Each feature is a word ✓ 1024-dim black box
Training speed Millions of docs, minutes, laptop ✓ GPU required at scale
Exact keyword / ID retrieval BM25 still wins ✓ Can miss rare tokens
Cold start (zero labels) Cosine sim on day one ✓ Needs fine-tuning data

Summary

Key takeaways

Preprocessing is 80% of the game

Before you touch any encoder, understand exactly what “a token” means in your corpus. Lowercase, stopwords, lemmatization — each step has a specific purpose.

Always inspect a single document’s top features

It is the fastest way to develop intuition about what your encoding is actually rewarding. Print OHE vs BoW vs TF-IDF side-by-side at least once.

Watch sparsity and class imbalance

Both will bite you long before modelling choices do. Use CSR storage. Never trust a single accuracy number on skewed data — always check per-class recall.

Know why you would pick the classical tool

If your answer is only “because it is in every tutorial”, reach for an embedding model. If your answer is “interpretability and speed” — BoW/TF-IDF are still excellent choices.

Interview Prep

Cheat sheet — quick definitions to remember

Define
What is One-Hot Encoding in NLP?
Binary presence vector over the vocabulary. 1 if the word appears in the document, 0 otherwise. No frequency, no order. Size = vocabulary length.
Binary: 0 or 1 Ignores frequency Simplest encoder
Define
What is Bag of Words?
Word count vector over the vocabulary. Stores how many times each word appears. Frequency-aware but order-blind — treats a document as an unordered bag of tokens.
Counts, not binary Order-blind CountVectorizer in sklearn
Define
What is TF-IDF and why does it outperform BoW?
Term Frequency × Inverse Document Frequency. Scales BoW counts down for words that appear in many documents. Words like “good” that are everywhere get suppressed; rare words that are distinctive get amplified. Formula: tf(t,d) · log(N / (1 + df(t)))
Rewards rarity Penalises ubiquity TfidfVectorizer
Compare
When would you use BoW over TF-IDF?
Use raw BoW counts with Naive Bayes — its probability estimates are count-based; IDF rescaling can hurt it. Otherwise, TF-IDF almost always gives a better signal for classifiers.
Naive Bayes → BoW Logistic Regression → TF-IDF
Gotcha
What is sparsity and why does it matter?
A BoW/TF-IDF matrix is typically 95–99% zeros because documents are short and vocabularies are large. Always store in sparse format (CSR) — a dense matrix of 1M docs × 200k vocab = 200B cells, which won’t fit in RAM.
98% zeros = normal Always use CSR format
Weakness
What can’t BoW/TF-IDF do that embeddings can?
They are lexical, not semantic. “Audio is excellent” and “sound is great” share zero tokens → zero similarity. “Battery lasts” and “battery dies” share most tokens → high similarity. Embeddings fix both by mapping meaning, not just words.
No synonyms No negation Use embeddings for semantics
Use Case
When do classical methods still win in 2026?
4 scenarios where BoW/TF-IDF beat neural alternatives: (1) exact-match / keyword search — BM25 still outperforms embeddings for identifier queries; (2) interpretability requirements; (3) training speed at millions of documents on a laptop; (4) cold-start with zero labelled data.
BM25 search Interpretability Cold start Speed

The GenAI Landscape: From Zero to Transformer Series name:…

GenAI Mastery Series · Chapter 02 · March 28, 2026 Coding Assistants, the AI/ML Roadmap, and How Machines Learn to Understand Language Read~14 min...
Vijay Gokarn
12 min read

Creating AI Storytelling Agents Using Flowise: A Step-by-Step Guide

In today’s world of AI, agents are becoming powerful tools to automate and simplify complex tasks, ranging from chatbots to interactive storytelling. Flowise is...
Vijay Gokarn
2 min read

Long Context LLM Comparison

GenAI Mastery Series · Long Context LLMs · Deep Dive Long Context LLMs — How They Work, How They Compare, and When to Use...
Vijay Gokarn
8 min read