This blog post will walk you through a project aimed at categorizing Wikipedia articles using OpenAI’s language model integrated into a Databricks notebook. We’ll cover the installation of necessary packages, dataset loading, and the categorization process.
Prerequisites
- Databricks account
- Basic understanding of Python
- OpenAI API key
Step-by-Step Guide
1. Install Necessary Packages
First, we need to install the required libraries, langchain_openai
and langchain_core
.
# Databricks notebook source
# MAGIC %pip install langchain_openai
# MAGIC %pip install --upgrade langchain_core langchain_openai
# COMMAND ----------
# MAGIC %restart_python
2. Import Required Libraries
Import the necessary libraries for the project.
import json
import time
import os
import getpass
import pandas as pd
from datasets import Dataset, load_dataset
from tqdm import tqdm
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
3. Load the Dataset
We will load a subset of the Wikipedia dataset for analysis.
dataset = load_dataset("wikimedia/wikipedia", "20231101.en")
NUM_SAMPLES = 10000
articles = dataset["train"][:NUM_SAMPLES]["text"]
ids = dataset["train"][:NUM_SAMPLES]["id"]
articles = [x.split("\n")[0] for x in articles]
4. Display Dataset Information
Print the length of the articles and display a sample article.
len(articles)
articles[99]
5. Enter OpenAI API Key
Prompt the user to enter their OpenAI API key.
codeos.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
6. Initialize the Language Model
Set up the language model from OpenAI.
codellm = ChatOpenAI()
llm.model_name
7. Define the Prompt Template
Create a template for the prompt that will be sent to the language model.
codeprompt = ChatPromptTemplate.from_messages([
("system", """Your task is to assess the article and categorize the article into one of the following predefined categories:
'History', 'Geography', 'Science', 'Technology', 'Mathematics', 'Literature', 'Art', 'Music', 'Film', 'Television', 'Sports', 'Politics', 'Philosophy', 'Religion', 'Sociology', 'Psychology', 'Economics', 'Business', 'Medicine', 'Biology', 'Chemistry', 'Physics', 'Astronomy', 'Environmental Science', 'Engineering', 'Computer Science', 'Linguistics', 'Anthropology', 'Archaeology', 'Education', 'Law', 'Military', 'Architecture', 'Fashion', 'Cuisine', 'Travel', 'Mythology', 'Folklore', 'Biography', 'Mythology', 'Social Issues', 'Human Rights', 'Technology Ethics', 'Climate Change', 'Conservation', 'Urban Studies', 'Demographics', 'Journalism', 'Cryptocurrency', 'Artificial Intelligence'
you will output a json object containing the following information:
{{
"id": string
"category": string
}}
"""),
("human", "{input}")
])
8. Create the Chain
Link the prompt to the language model.
codechain = prompt | llm
9. Test the Chain
Invoke the chain with a sample article to see the output.
codecontent = json.dumps({"id": ids[0], "article": articles[0]})
response = chain.invoke(content)
response.content
10. Process Articles in Batches
Create a function to process articles in batches to avoid hitting API rate limits.
coderesults = []
BATCH_SIZE = 8
inputs = []
for index, article in tqdm(enumerate(articles[:1000])):
inputs.append(json.dumps({"id": ids[index], "article": articles[index]}))
if len(inputs) == BATCH_SIZE:
time.sleep(1.5)
response = chain.batch(inputs)
results += response
inputs = []
if inputs:
response = chain.batch(inputs)
results += response
11. Analyze Results
Separate successful and failed responses and convert the successful ones into a DataFrame.
codesuccess = []
failure = []
for output in results:
content = output.content
try:
content = json.loads(content)
success.append(content)
except ValueError as e:
failure.append(content)
pd.DataFrame(success)
12. Conclusion
This project demonstrates how to use OpenAI’s language model to categorize Wikipedia articles efficiently. By following this guide, you can apply similar techniques to other text classification tasks. Git Link
Final Notes
The project showcases the power of integrating language models with Databricks, providing a robust framework for large-scale text analysis tasks. Remember to handle API keys securely and manage rate limits effectively when working with large datasets.