Context
Why bother with “classical” text features at all?
If you have been anywhere near an LLM in the last two years, you have probably heard that “embeddings solved text.” They did — for a lot of problems. But if you are building a spam filter with 100k labelled examples, a BM25-powered search box, a cold-start classifier for a brand-new product line, or a compliance-audited system where a human needs to understand why the model fired — then Bag of Words and TF-IDF are still in the toolbox.
They are fast, deterministic, interpretable, and an honest baseline you should always beat before reaching for a neural model.
Step 1
Get real data — not toy sentences
Every blog post on TF-IDF uses the same three cooked-up sentences about cats and dogs. I wanted the messiness of real user-generated content, so I wrote a BeautifulSoup scraper across ~20 popular ASINs — Echo Dots, AirPods Pro, Kindles, an Apple Watch, a Ninja blender, a PS5 controller, a Nespresso machine, and so on.
User-Agent header or Amazon returns a stripped page. Anchor on [data-hook="review-body"] inside celwidget blocks — not the div[data-hook="review"] wrapper on the dedicated reviews page. A few reviews came back in Spanish and Arabic — a lovely reminder that real data never matches the shape your slides promised.
Step 2
Clean the text — the boring part that matters most
A review like “I LOVE it!!! Sound is 🔥. Read more” is not something a counting-based model can work with. Each cleaning step kills a specific kind of noise:
| Step | What it kills | Why it matters |
|---|---|---|
| Lowercase | LOVE vs love | Avoids vocabulary duplicates |
| Drop “Read more” | Amazon truncation marker | Otherwise becomes one of the most frequent tokens |
| Strip punctuation / digits | !!!, $199 | They rarely help classical models |
| Tokenize | — | Gives you units to count |
| Remove stopwords | the, and, is | Appear in every document → no signal |
| Lemmatize | speakers → speaker | Tightens the vocabulary |
After processing: 11,138 tokens spanning a 3,461-word vocabulary. Top words were exactly the product-review clichés you would expect — use, one, like, great, noise, sound, quality — a perfect sanity check.
Step 3
Three ways to turn text into numbers
One-Hot Encoding
OHE · Binary presenceFor each review, build a binary vector over the whole vocabulary: 1 if the word appears, 0 otherwise. Simplest thing that works, easiest to explain to a non-technical stakeholder.
⚠ Throws away frequency — “amazing” once and ten times look identical.
Bag of Words
BoW · CountVectorizerSame vector shape, but store actual counts. A review that hammers on “sound” three times ranks differently from one that drops the word once. Frequency-aware.
⚠ Still order-blind — “not good, very bad” ≈ “good, not very bad”.
TF-IDF
TfidfVectorizer · The trickTake the BoW count and divide by how common the word is across the whole corpus. Generic words like “good” get pushed toward zero. Rare, distinctive words like “cancellation” stay loud.
✓ Best signal for downstream classifiers.
In my corpus, the highest-IDF words were exactly the long-tail product features that appeared in just one review. The lowest-IDF words were the generic review vocabulary. That is the whole story of TF-IDF in one experiment.
Step 4
The “aha” moment — one review, three lenses
Encode the same review three times and print the top-weighted tokens:
BoW surfaces the most repeated words — almost always filler like
one, like, use.TF-IDF surfaces the words this review says that few others do. That is exactly what a downstream classifier wants to see.
Once you have seen this side-by-side even once, you stop reaching for plain BoW unless you have a very specific reason. (Naive Bayes is one — its underlying math prefers raw counts.)
Step 5
Sparsity — the thing nobody warns you about
Every one of my three matrices came out ~98.15% zero. That is normal — reviews are short, vocabularies are long, and most words do not appear in most documents. Two huge practical implications:
Classical pipelines do not scale forever. Once you are in the tens-of-millions-of-documents range, even sparse storage becomes painful — which is one reason industry moved to dense embedding pipelines for web-scale retrieval.
Step 6
A mini sentiment classifier — and a class imbalance lesson
4–5 star = positive, 1–2 star = negative, 3-star dropped. Two models per feature set: Logistic Regression with class_weight="balanced" and Multinomial Naive Bayes.
Headline accuracy looks great — ~97% on the test split. But the test split has 31 positives and 1 negative. The interesting metric is recall on the negative class, and with only five one-star reviews in the whole corpus, no model is going to learn that cleanly. Amazon surfaces highly-rated reviews first, so any pipeline that scrapes top-of-page reviews inherits the same lopsided distribution.
Naive Bayes prefers raw BoW counts — rescaling with IDF can actually hurt it.
Never trust a single accuracy number on imbalanced data. Always print per-class precision/recall.
Step 7
Where these techniques break — and where they still win
| Scenario | BoW / TF-IDF | Embeddings |
|---|---|---|
| Semantic similarity “audio excellent” vs “sound great” |
Zero shared tokens → fails | Maps synonyms close ✓ |
| Negation “battery lasts” vs “battery dies” |
Near-identical vectors → fails | Directional context ✓ |
| Interpretability | Each feature is a word ✓ | 1024-dim black box |
| Training speed | Millions of docs, minutes, laptop ✓ | GPU required at scale |
| Exact keyword / ID retrieval | BM25 still wins ✓ | Can miss rare tokens |
| Cold start (zero labels) | Cosine sim on day one ✓ | Needs fine-tuning data |