Analyzing Wikipedia Articles with Langchain and OpenAI in Databricks

This blog post will walk you through a project aimed at categorizing Wikipedia articles using OpenAI’s language model integrated into a Databricks notebook. We’ll cover the installation of necessary packages, dataset loading, and the categorization process.

Prerequisites

Databricks account
Basic understanding of Python
OpenAI API key

Step-by-Step Guide

1. Install Necessary Packages

First, we need to install the required libraries, langchain_openai and langchain_core.

# Databricks notebook source
# MAGIC %pip install langchain_openai
# MAGIC %pip install --upgrade langchain_core langchain_openai

# COMMAND ----------
# MAGIC %restart_python

2. Import Required Libraries

Import the necessary libraries for the project.

import json
import time
import os
import getpass
import pandas as pd
from datasets import Dataset, load_dataset
from tqdm import tqdm
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

3. Load the Dataset

We will load a subset of the Wikipedia dataset for analysis.

dataset = load_dataset("wikimedia/wikipedia", "20231101.en")
NUM_SAMPLES = 10000
articles = dataset["train"][:NUM_SAMPLES]["text"]
ids = dataset["train"][:NUM_SAMPLES]["id"]
articles = [x.split("\n")[0] for x in articles]

4. Display Dataset Information

Print the length of the articles and display a sample article.

len(articles)
articles[99]

5. Enter OpenAI API Key

Prompt the user to enter their OpenAI API key.

codeos.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

6. Initialize the Language Model

Set up the language model from OpenAI.

codellm = ChatOpenAI()
llm.model_name

7. Define the Prompt Template

Create a template for the prompt that will be sent to the language model.

codeprompt = ChatPromptTemplate.from_messages([
    ("system", """Your task is to assess the article and categorize the article into one of the following predefined categories:
'History', 'Geography', 'Science', 'Technology', 'Mathematics', 'Literature', 'Art', 'Music', 'Film', 'Television', 'Sports', 'Politics', 'Philosophy', 'Religion', 'Sociology', 'Psychology', 'Economics', 'Business', 'Medicine', 'Biology', 'Chemistry', 'Physics', 'Astronomy', 'Environmental Science', 'Engineering', 'Computer Science', 'Linguistics', 'Anthropology', 'Archaeology', 'Education', 'Law', 'Military', 'Architecture', 'Fashion', 'Cuisine', 'Travel', 'Mythology', 'Folklore', 'Biography', 'Mythology', 'Social Issues', 'Human Rights', 'Technology Ethics', 'Climate Change', 'Conservation', 'Urban Studies', 'Demographics', 'Journalism', 'Cryptocurrency', 'Artificial Intelligence'
you will output a json object containing the following information:
{{
    "id": string
    "category": string
}}
"""),
    ("human", "{input}")
])

8. Create the Chain

Link the prompt to the language model.

codechain = prompt | llm

9. Test the Chain

Invoke the chain with a sample article to see the output.

codecontent = json.dumps({"id": ids[0], "article": articles[0]})
response = chain.invoke(content)
response.content

10. Process Articles in Batches

Create a function to process articles in batches to avoid hitting API rate limits.

coderesults = []
BATCH_SIZE = 8
inputs = []

for index, article in tqdm(enumerate(articles[:1000])):
    inputs.append(json.dumps({"id": ids[index], "article": articles[index]}))
    
    if len(inputs) == BATCH_SIZE:
        time.sleep(1.5)
        response = chain.batch(inputs)
        results += response
        inputs = []
        
if inputs:
    response = chain.batch(inputs)
    results += response

11. Analyze Results

Separate successful and failed responses and convert the successful ones into a DataFrame.

codesuccess = []
failure = []

for output in results:
    content = output.content
    try:
        content = json.loads(content)
        success.append(content)
    except ValueError as e:
        failure.append(content)

pd.DataFrame(success)

12. Conclusion

This project demonstrates how to use OpenAI’s language model to categorize Wikipedia articles efficiently. By following this guide, you can apply similar techniques to other text classification tasks. Git Link

Final Notes

The project showcases the power of integrating language models with Databricks, providing a robust framework for large-scale text analysis tasks. Remember to handle API keys securely and manage rate limits effectively when working with large datasets.

Vijay Gokarn