GenAI Mastery Series · RAG · NLP · HuggingFace · FAISS
Retrieval-Augmented Generation — Build a RAG Pipeline from Scratch
RAG (Retrieval-Augmented Generation) combines the factual accuracy of information retrieval with the fluent language generation of large models. Instead of encoding all knowledge into model weights, it retrieves relevant documents at inference time and uses them as live context — producing more accurate, up-to-date, and traceable answers.
Concept
What is RAG?
RAG stands for Retrieval-Augmented Generation. It’s a three-stage technique that grounds a language model’s output in retrieved facts — dramatically reducing hallucinations and enabling access to knowledge that wasn’t in the model’s training data.
Retrieval
Search a large knowledge base to find documents relevant to the query. Uses dense vector similarity (DPR + FAISS) or sparse BM25.
Augmentation
Inject the retrieved documents into the prompt as context — supplementing the original query with verified external knowledge.
Generation
The language model generates its response conditioned on both the original query and the retrieved context — grounded, not hallucinated.
Motivation
Why RAG matters
Reduces Hallucinations
Retrieved documents anchor the model to factual content, reducing fabricated answers that plague purely parametric models.
Up-to-Date Knowledge
The knowledge base can be updated without retraining the entire model — keeping responses current without a costly fine-tune cycle.
Source Transparency
Every response is traceable back to specific retrieved documents — critical for compliance, audit, and trust in enterprise deployments.
Domain Customization
Swap the knowledge base to instantly specialize the system for healthcare, legal, finance, or any vertical — without model retraining.
Compute Efficiency
It’s far cheaper to retrieve specific knowledge at inference than to encode it all into billions of model parameters during training.
Contextual Relevance
Each query pulls its own specific context — the model isn’t constrained to a fixed knowledge snapshot, it adapts per-query.
Architecture
Pipeline overview
This implementation uses Dense Passage Retrieval (DPR) for encoding both questions and documents into a shared vector space, FAISS for fast nearest-neighbor search, and BART as the sequence-to-sequence generator.
Query
User input
DPR Encoder
Question embedding
FAISS Search
Top-K similar docs
Concat
Query + context
BART
Generate response
Step 1
Install dependencies
Four libraries cover everything: transformers for DPR and BART models, datasets for data loading, faiss-cpu for vector indexing, and torch as the deep learning backend.
faiss-cpu with faiss-gpu for GPU-accelerated indexing on large corpora. For the retriever and generator models, ensure CUDA is available — BART on CPU is significantly slower for generation.
Step 2
Load models & tokenizers
RAG uses two separate encoders from DPR: a Question Encoder for query embeddings and a Context Encoder for document embeddings. These are trained jointly so their vectors are comparable in the same space. BART is the conditional generation model that produces the final answer.
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
from transformers import BartTokenizer, BartForConditionalGeneration
# ── RETRIEVER: encodes the user's question into a dense vector ──
question_encoder = DPRQuestionEncoder.from_pretrained(
'facebook/dpr-question_encoder-single-nq-base'
)
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained(
'facebook/dpr-question_encoder-single-nq-base'
)
# ── RETRIEVER: encodes each document into a dense vector ──
context_encoder = DPRContextEncoder.from_pretrained(
'facebook/dpr-ctx_encoder-single-nq-base'
)
context_tokenizer = DPRContextEncoderTokenizer.from_pretrained(
'facebook/dpr-ctx_encoder-single-nq-base'
)
# ── GENERATOR: seq2seq model that produces the final answer ──
generator_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
generator = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')Step 3
Prepare your documents
Each document needs a title and text field. In production this could be chunks from a PDF, database rows, web-scraped articles, or any domain corpus. Keep chunk sizes manageable — 256 to 512 tokens is a common sweet spot for retrieval quality.
# Each document is a dict with title + text
# In production: load from PDF, DB, S3, or a vector store
documents = [
{"title": "Document 1", "text": "This is the text of document 1."},
{"title": "Document 2", "text": "This is the text of document 2."},
# Add your real documents here...
]Step 4
Build the FAISS index
Each document’s text is encoded by the DPR Context Encoder into a 768-dimensional vector. These embeddings are stacked into a numpy array and added to a FAISS IndexFlatL2 — a flat L2-distance index that does exact nearest-neighbor search.
import faiss
import numpy as np
context_embeddings = []
for doc in documents:
inputs = context_tokenizer(doc['text'], return_tensors='pt')
embeddings = context_encoder(**inputs).pooler_output.detach().numpy()
context_embeddings.append(embeddings[0])
# Stack into (num_docs, 768) numpy array
context_embeddings = np.array(context_embeddings)
# Build FAISS flat L2 index — exact search, best for small-medium corpora
index = faiss.IndexFlatL2(context_embeddings.shape[1])
index.add(context_embeddings)
print(f"Index built: {index.ntotal} vectors of dim {context_embeddings.shape[1]}")IndexFlatL2 does exhaustive search — fine for thousands of docs. For millions, switch to IndexIVFFlat (inverted file) or IndexHNSWFlat (graph-based ANN) for approximate but fast search. Managed options: Pinecone, Weaviate, pgvector.
Step 5
Retrieve relevant documents
For a given query, the Question Encoder produces a dense vector in the same space as the document embeddings. FAISS finds the top-K nearest documents by L2 distance — these are the most semantically similar docs to the query.
def retrieve_documents(query, top_k=5):
# Encode the query into a dense vector
inputs = question_tokenizer(query, return_tensors='pt')
question_embedding = question_encoder(**inputs).pooler_output.detach().numpy()
# Search FAISS for the top_k nearest document vectors
_, indices = index.search(question_embedding, top_k)
# Return the actual document dicts
return [documents[idx] for idx in indices[0]]
# ── Test it ──
query = "What is the text of document 1?"
retrieved_docs = retrieve_documents(query)
print(retrieved_docs)Step 6
Generate the response
The retrieved document texts are concatenated and prepended to the query. BART processes this combined input using beam search (num_beams=4) to generate a fluent, coherent response grounded in the retrieved facts.
def generate_response(query, retrieved_docs):
# Concat all retrieved doc texts into one context string
context = " ".join([doc['text'] for doc in retrieved_docs])
# Tokenize query + context together (truncate at 1024 tokens)
inputs = generator_tokenizer(
query + " " + context,
return_tensors = 'pt',
max_length = 1024,
truncation = True
)
# Generate with beam search for higher quality output
summary_ids = generator.generate(
inputs['input_ids'],
num_beams = 4,
max_length = 512,
early_stopping = True
)
return generator_tokenizer.decode(summary_ids[0], skip_special_tokens=True)
# ── End-to-end test ──
query = "What is the text of document 1?"
retrieved_docs = retrieve_documents(query)
response = generate_response(query, retrieved_docs)
print(response)num_beams=1 is greedy — fast but lower quality. num_beams=4 keeps 4 candidate sequences at each step and returns the highest-scoring one. For production, num_beams=4 with early_stopping=True is the standard BART configuration.
Interview Prep
Cheat sheet — quick definitions to remember
What is RAG in one sentence?
What is DPR and why two encoders?
What is FAISS and when would you replace it?
IndexFlatL2 does exact exhaustive search — fine for thousands of docs but O(n) per query. For millions of docs, switch to IndexIVFFlat (approximate, partitioned) or a managed vector DB like Pinecone, Weaviate, or pgvector.
RAG vs fine-tuning — when to use which?
What is the “lost in the middle” problem in RAG?
What does beam search do in generation?
num_beams=4 explores 4 paths simultaneously — better quality than greedy, with manageable compute overhead.
How would you modernize this RAG stack for production?