Analyzing Wikipedia Articles with Langchain and OpenAI in Databricks

9 min read

GenAI Mastery Series · NLP · Databricks · LangChain

Categorizing Wikipedia at Scale with OpenAI, LangChain & Databricks

Datasetwikimedia/wikipedia · 10,000 articles

ModelChatOpenAI (GPT-4)

Output50-category JSON classifier

Stack Databricks Notebook LangChain Core langchain_openai HuggingFace Datasets ChatPromptTemplate Batch Inference JSON Parsing

A complete walkthrough of a large-scale text classification pipeline built inside a Databricks notebook — from loading 10,000 Wikipedia articles to batch-classifying them into 50 categories using OpenAI’s language model via LangChain. Every step includes the real working code.

Prerequisites
Databricks Account Python (basic) OpenAI API Key HuggingFace Access

Overview

Pipeline architecture

The full pipeline runs end-to-end inside a single Databricks notebook. Wikipedia articles are loaded from HuggingFace, cleaned to first-line summaries, batched, and sent to GPT-4 via LangChain’s chain interface. Responses are parsed from JSON into a DataFrame.

📦
HuggingFace

wikimedia/wikipedia dataset

✂️
Clean

First-line extraction

LangChain

Prompt + ChatOpenAI

🔄
Batch (8)

Rate-limit safe

📊
DataFrame

id + category


Step 1

Install required packages

In a Databricks notebook, use %pip magic commands to install packages into the cluster. The %restart_python command refreshes the interpreter to pick up the new packages without restarting the whole cluster.

Databricks Notebook — Cell 1Python / Magic
%pip install langchain_openai %pip install --upgrade langchain_core langchain_openai %restart_python

Step 2

Import libraries

Standard Python utilities (json, time, os) combined with LangChain for the LLM interface, HuggingFace Datasets for Wikipedia data loading, and tqdm for progress visibility during batch processing.

Databricks Notebook — Cell 2Python
import json import time import os import getpass import pandas as pd from datasets import Dataset, load_dataset from tqdm import tqdm from langchain_core.prompts import ChatPromptTemplate from langchain_openai import ChatOpenAI

Step 3

Load & clean the dataset

The HuggingFace wikimedia/wikipedia dataset is massive — we take a 10,000 article slice from the English November 2023 snapshot. The cleaning step extracts only the first line of each article (the summary sentence), which is sufficient for category classification and drastically reduces token usage.

Databricks Notebook — Cell 3Python
# Load the Wikipedia English dataset (Nov 2023 snapshot) dataset = load_dataset("wikimedia/wikipedia", "20231101.en") # Take a 10k article sample NUM_SAMPLES = 10000 articles = dataset["train"][:NUM_SAMPLES]["text"] ids = dataset["train"][:NUM_SAMPLES]["id"] # Clean: keep only the first line (article summary) to reduce tokens articles = [x.split("\n")[0] for x in articles] # Sanity check print(len(articles)) # → 10000 print(articles[99]) # inspect a sample article
Why first line only? Wikipedia article summaries are dense and self-contained. Using the full article would cost ~10–50x more tokens per classification with minimal accuracy gain. At 10k articles × avg 150 tokens = ~1.5M input tokens — already significant. First-line only brings that to ~200k tokens.

Step 4

Configure OpenAI + LangChain

Use getpass to securely prompt for the API key without echoing it to the notebook output. Then initialize ChatOpenAI — LangChain’s wrapper around the OpenAI Chat Completions API.

Databricks Notebook — Cell 4 & 5Python
# Securely enter API key (won't echo to notebook output) os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ") # Initialize the LangChain ChatOpenAI wrapper llm = ChatOpenAI() print(llm.model_name) # → "gpt-3.5-turbo" (default) or your configured model

Step 5 — Core Logic

Define the prompt template

The ChatPromptTemplate structures the conversation: a system message sets the classification task with all 50 categories, and the human message carries the article payload. The double curly braces {{ }} in the JSON schema escape the literal braces so LangChain doesn’t treat them as template variables.

Databricks Notebook — Cell 6Python
prompt = ChatPromptTemplate.from_messages([ ("system", """Your task is to assess the article and categorize it into one of the following predefined categories: 'History', 'Geography', 'Science', 'Technology', 'Mathematics', 'Literature', 'Art', 'Music', 'Film', 'Television', 'Sports', 'Politics', 'Philosophy', 'Religion', 'Sociology', 'Psychology', 'Economics', 'Business', 'Medicine', 'Biology', 'Chemistry', 'Physics', 'Astronomy', 'Environmental Science', 'Engineering', 'Computer Science', 'Linguistics', 'Anthropology', 'Archaeology', 'Education', 'Law', 'Military', 'Architecture', 'Fashion', 'Cuisine', 'Travel', 'Mythology', 'Folklore', 'Biography', 'Social Issues', 'Human Rights', 'Technology Ethics', 'Climate Change', 'Conservation', 'Urban Studies', 'Demographics', 'Journalism', 'Cryptocurrency', 'Artificial Intelligence' Output ONLY a JSON object — no extra text: {{ "id": string, "category": string }}"""), ("human", "{input}") ])
Prompt engineering note: Listing all valid categories explicitly in the system prompt constrains the model to valid outputs — reducing hallucinated or free-form category names. The strict JSON output instruction combined with downstream json.loads() parsing creates a simple but robust structured output pipeline.

Step 6

Build the chain & test it

LangChain’s pipe operator | composes the prompt template and the LLM into a reusable chain. One call to .invoke() with a single article validates the whole setup before committing to batch processing.

Databricks Notebook — Cell 7Python
# Compose prompt → llm into a reusable chain chain = prompt | llm # Test with article[0] before running the full batch content = json.dumps({"id": ids[0], "article": articles[0]}) response = chain.invoke(content) print(response.content) # → {"id": "1", "category": "History"}

Step 7 — Core Loop

Batch processing with rate-limit handling

Processing 1,000 articles one-by-one would quickly hit OpenAI’s requests-per-minute limit. The solution: accumulate inputs into batches of 8 and call .batch() with a 1.5-second sleep between each batch. tqdm wraps the loop to give live progress in the notebook.

Databricks Notebook — Cell 8Python
results = [] BATCH_SIZE = 8 inputs = [] for index, article in tqdm(enumerate(articles[:1000])): inputs.append( json.dumps({"id": ids[index], "article": articles[index]}) ) if len(inputs) == BATCH_SIZE: time.sleep(1.5) # respect rate limits response = chain.batch(inputs) results += response inputs = [] # reset buffer # Flush any remaining articles in the last partial batch if inputs: response = chain.batch(inputs) results += response
Rate limit strategy: Batch size 8 + 1.5s sleep = ~5 batches/sec = ~40 requests/sec. For the free OpenAI tier (3 RPM), reduce batch size to 1 and increase sleep to 20s. For production use, implement exponential backoff with tenacity.

Step 8

Parse results into a DataFrame

Not every LLM response will be valid JSON — network hiccups, model refusals, and malformed outputs all happen at scale. The pattern below separates successful parses from failures so you can inspect and retry the failures without losing the successful results.

Databricks Notebook — Cell 9Python
success = [] failure = [] for output in results: content = output.content try: content = json.loads(content) success.append(content) except ValueError as e: failure.append(content) # keep for retry / inspection print(f"Success: {len(success)} | Failure: {len(failure)}") # Convert to DataFrame for analysis / export df = pd.DataFrame(success) df.head(10)

Sample Output

What the pipeline produces

10k
Articles Loaded
1k
Articles Classified
50
Categories
8
Batch Size
LLM Response — Single Article JSON output
[ { “id”: “1”, “category”: “History” }, { “id”: “4”, “category”: “Computer Science” }, { “id”: “7”, “category”: “Biology” } ]

All 50 available classification categories:

HistoryGeographyScienceTechnologyMathematicsLiteratureArtMusicFilmTelevisionSportsPoliticsPhilosophyReligionSociologyPsychologyEconomicsBusinessMedicineBiologyChemistryPhysicsAstronomyEnvironmental ScienceEngineeringComputer ScienceLinguisticsAnthropologyArchaeologyEducationLawMilitaryArchitectureFashionCuisineTravelMythologyFolkloreBiographySocial IssuesHuman RightsArtificial IntelligenceCryptocurrencyClimate ChangeConservationUrban StudiesJournalismTechnology EthicsDemographics

Interview Prep

Cheat sheet — quick definitions to remember

Define
What is LangChain and what problem does it solve?
A framework for composing LLM-powered applications from modular building blocks — prompts, models, chains, memory, tools, and agents. It solves the orchestration problem: how do you connect a prompt template to an LLM, parse the output, and chain multiple steps together cleanly?
Prompt + LLM + OutputComposable chainsPipe operator |
Explain
What is a ChatPromptTemplate?
A reusable message template that structures the conversation for a chat model. Defines the system role (task instructions) and the human turn (variable input). The {input} placeholder gets filled at runtime. Separating instructions from data is a core prompt engineering best practice.
System = instructionsHuman = data{input} placeholder
Explain
Why use .batch() instead of looping .invoke()?
.batch() sends multiple requests concurrently using asyncio under the hood, while .invoke() is sequential. For 8 articles, batch is roughly 8x faster. The sleep between batches manages rate limits — you get concurrency within a batch, pacing across batches.
Concurrent within batchSleep between batches8x throughput gain
Gotcha
Why separate success and failure lists instead of crashing on parse error?
At 1,000+ LLM calls, some will fail — network timeouts, content policy refusals, or models that occasionally output extra text before the JSON. A try/except pattern collects failures without losing the successful results. Failures can be inspected and retried separately.
Never crash on parse errorInspect failures separatelyRetry pattern
Best Practice
How do you get reliable structured JSON from an LLM?
Three layers: (1) Constrain in the prompt — list valid values, specify exact schema, say “output ONLY JSON”. (2) Use LangChain’s output parsers (JsonOutputParser) for automatic parsing and retry. (3) Validate with Pydantic — define a model and parse the JSON through it to catch type errors.
Constrain schema in promptJsonOutputParserPydantic validation
Explain
Why use Databricks for this pipeline?
Databricks provides a managed Spark + Python environment that scales horizontally. For 10k–10M articles, you can parallelize across a cluster using Spark UDFs or pandas_udf. It also integrates with Delta Lake for storing results, MLflow for experiment tracking, and Unity Catalog for data governance.
Horizontal scaleDelta Lake storageMLflow tracking
Improve
How would you scale this to 10 million articles?
Three upgrades: (1) Wrap the chain call in a Spark pandas_udf so it runs in parallel across the cluster. (2) Replace time.sleep() with exponential backoff via tenacity. (3) Use LangChain’s async batch with chain.abatch() and asyncio for maximum concurrency per node.
Spark pandas_udfchain.abatch()tenacity backoff

AI Pre-Trade Analyzer

AI Pre-Trade Analyzer — Powered by Claude Tool Type Browser-side AI Agent Model Claude Sonnet 4 Scores 12 Dimensions Stack HTML · JS ·...
Vijay Gokarn
15 min read

From Amazon Reviews to Numbers: A Hands-On Tour of…

NLP · Machine Learning · Text Feature Engineering From Amazon Reviews to Numbers: A Hands-On Tour of One-Hot, Bag of Words, and TF-IDF Corpus128...
Vijay Gokarn
8 min read

The GenAI Landscape: From Zero to Transformer Series name:…

GenAI Mastery Series · Chapter 02 · March 28, 2026 Coding Assistants, the AI/ML Roadmap, and How Machines Learn to Understand Language Read~14 min...
Vijay Gokarn
12 min read