GenAI Mastery Series · RAG · NLP · HuggingFace · FAISS

Retrieval-Augmented Generation — Build a RAG Pipeline from Scratch

RetrieverDPR + FAISS

GeneratorBART (facebook/bart-large-cnn)

StackHuggingFace Transformers · PyTorch

Stack transformers faiss-cpu torch DPR Question Encoder DPR Context Encoder BART Generator datasets

RAG (Retrieval-Augmented Generation) combines the factual accuracy of information retrieval with the fluent language generation of large models. Instead of encoding all knowledge into model weights, it retrieves relevant documents at inference time and uses them as live context — producing more accurate, up-to-date, and traceable answers.

Concept

What is RAG?

RAG stands for Retrieval-Augmented Generation. It’s a three-stage technique that grounds a language model’s output in retrieved facts — dramatically reducing hallucinations and enabling access to knowledge that wasn’t in the model’s training data.

Retrieval

Search a large knowledge base to find documents relevant to the query. Uses dense vector similarity (DPR + FAISS) or sparse BM25.

Augmentation

Inject the retrieved documents into the prompt as context — supplementing the original query with verified external knowledge.

Generation

The language model generates its response conditioned on both the original query and the retrieved context — grounded, not hallucinated.

Motivation

Why RAG matters

Reduces Hallucinations

Retrieved documents anchor the model to factual content, reducing fabricated answers that plague purely parametric models.

Up-to-Date Knowledge

The knowledge base can be updated without retraining the entire model — keeping responses current without a costly fine-tune cycle.

Source Transparency

Every response is traceable back to specific retrieved documents — critical for compliance, audit, and trust in enterprise deployments.

Domain Customization

Swap the knowledge base to instantly specialize the system for healthcare, legal, finance, or any vertical — without model retraining.

Compute Efficiency

It’s far cheaper to retrieve specific knowledge at inference than to encode it all into billions of model parameters during training.

Contextual Relevance

Each query pulls its own specific context — the model isn’t constrained to a fixed knowledge snapshot, it adapts per-query.

Architecture

Pipeline overview

This implementation uses Dense Passage Retrieval (DPR) for encoding both questions and documents into a shared vector space, FAISS for fast nearest-neighbor search, and BART as the sequence-to-sequence generator.

Query

User input

→

DPR Encoder

Question embedding

→

FAISS Search

Top-K similar docs

→

Concat

Query + context

→

BART

Generate response

Step 1

Install dependencies

Four libraries cover everything: transformers for DPR and BART models, datasets for data loading, faiss-cpu for vector indexing, and torch as the deep learning backend.

$pip install transformers datasets faiss-cpu torch

GPU note: Replace faiss-cpu with faiss-gpu for GPU-accelerated indexing on large corpora. For the retriever and generator models, ensure CUDA is available — BART on CPU is significantly slower for generation.

Step 2

Load models & tokenizers

RAG uses two separate encoders from DPR: a Question Encoder for query embeddings and a Context Encoder for document embeddings. These are trained jointly so their vectors are comparable in the same space. BART is the conditional generation model that produces the final answer.

models.pyPython

from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
from transformers import BartTokenizer, BartForConditionalGeneration

# ── RETRIEVER: encodes the user's question into a dense vector ──
question_encoder = DPRQuestionEncoder.from_pretrained(
    'facebook/dpr-question_encoder-single-nq-base'
)
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained(
    'facebook/dpr-question_encoder-single-nq-base'
)

# ── RETRIEVER: encodes each document into a dense vector ──
context_encoder = DPRContextEncoder.from_pretrained(
    'facebook/dpr-ctx_encoder-single-nq-base'
)
context_tokenizer = DPRContextEncoderTokenizer.from_pretrained(
    'facebook/dpr-ctx_encoder-single-nq-base'
)

# ── GENERATOR: seq2seq model that produces the final answer ──
generator_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
generator = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

Step 3

Prepare your documents

Each document needs a title and text field. In production this could be chunks from a PDF, database rows, web-scraped articles, or any domain corpus. Keep chunk sizes manageable — 256 to 512 tokens is a common sweet spot for retrieval quality.

data.pyPython

# Each document is a dict with title + text
# In production: load from PDF, DB, S3, or a vector store
documents = [
    {"title": "Document 1", "text": "This is the text of document 1."},
    {"title": "Document 2", "text": "This is the text of document 2."},
    # Add your real documents here...
]

Chunking strategy matters: For long documents, split into overlapping chunks of ~300 tokens. Too large = diluted retrieval signal. Too small = missing context for the generator. Overlap of ~50 tokens prevents cutting a sentence at a chunk boundary.

Step 4

Build the FAISS index

Each document’s text is encoded by the DPR Context Encoder into a 768-dimensional vector. These embeddings are stacked into a numpy array and added to a FAISS IndexFlatL2 — a flat L2-distance index that does exact nearest-neighbor search.

indexer.pyPython

import faiss
import numpy as np

context_embeddings = []

for doc in documents:
    inputs     = context_tokenizer(doc['text'], return_tensors='pt')
    embeddings = context_encoder(**inputs).pooler_output.detach().numpy()
    context_embeddings.append(embeddings[0])

# Stack into (num_docs, 768) numpy array
context_embeddings = np.array(context_embeddings)

# Build FAISS flat L2 index — exact search, best for small-medium corpora
index = faiss.IndexFlatL2(context_embeddings.shape[1])
index.add(context_embeddings)

print(f"Index built: {index.ntotal} vectors of dim {context_embeddings.shape[1]}")

Scaling up: IndexFlatL2 does exhaustive search — fine for thousands of docs. For millions, switch to IndexIVFFlat (inverted file) or IndexHNSWFlat (graph-based ANN) for approximate but fast search. Managed options: Pinecone, Weaviate, pgvector.

Step 5

Retrieve relevant documents

For a given query, the Question Encoder produces a dense vector in the same space as the document embeddings. FAISS finds the top-K nearest documents by L2 distance — these are the most semantically similar docs to the query.

retriever.pyPython

def retrieve_documents(query, top_k=5):
    # Encode the query into a dense vector
    inputs            = question_tokenizer(query, return_tensors='pt')
    question_embedding = question_encoder(**inputs).pooler_output.detach().numpy()

    # Search FAISS for the top_k nearest document vectors
    _, indices = index.search(question_embedding, top_k)

    # Return the actual document dicts
    return [documents[idx] for idx in indices[0]]


# ── Test it ──
query         = "What is the text of document 1?"
retrieved_docs = retrieve_documents(query)
print(retrieved_docs)

Step 6

Generate the response

The retrieved document texts are concatenated and prepended to the query. BART processes this combined input using beam search (num_beams=4) to generate a fluent, coherent response grounded in the retrieved facts.

generator.pyPython

def generate_response(query, retrieved_docs):
    # Concat all retrieved doc texts into one context string
    context = " ".join([doc['text'] for doc in retrieved_docs])

    # Tokenize query + context together (truncate at 1024 tokens)
    inputs = generator_tokenizer(
        query + " " + context,
        return_tensors = 'pt',
        max_length     = 1024,
        truncation     = True
    )

    # Generate with beam search for higher quality output
    summary_ids = generator.generate(
        inputs['input_ids'],
        num_beams     = 4,
        max_length    = 512,
        early_stopping = True
    )

    return generator_tokenizer.decode(summary_ids[0], skip_special_tokens=True)


# ── End-to-end test ──
query         = "What is the text of document 1?"
retrieved_docs = retrieve_documents(query)
response      = generate_response(query, retrieved_docs)
print(response)

Beam search vs greedy: num_beams=1 is greedy — fast but lower quality. num_beams=4 keeps 4 candidate sequences at each step and returns the highest-scoring one. For production, num_beams=4 with early_stopping=True is the standard BART configuration.

Interview Prep

Cheat sheet — quick definitions to remember

Define
What is RAG in one sentence?

Retrieve relevant documents, inject them as context, then generate a grounded response — combining the precision of information retrieval with the fluency of a language model to reduce hallucinations and enable access to external knowledge.

Retrieve → Augment → GenerateReduces hallucinations

Explain
What is DPR and why two encoders?

Dense Passage Retrieval — two separate BERT-based encoders trained jointly: a Question Encoder and a Context Encoder. They’re kept separate because questions and documents have different linguistic structure. Joint training ensures their output vectors are comparable in the same embedding space.

Question EncoderContext EncoderSame vector space

Explain
What is FAISS and when would you replace it?

Facebook AI Similarity Search — a library for fast nearest-neighbor search over dense vectors. IndexFlatL2 does exact exhaustive search — fine for thousands of docs but O(n) per query. For millions of docs, switch to IndexIVFFlat (approximate, partitioned) or a managed vector DB like Pinecone, Weaviate, or pgvector.

Exact: IndexFlatL2Approx: IVFFlat / HNSWScale: Pinecone / pgvector

Compare
RAG vs fine-tuning — when to use which?

Use RAG when knowledge changes frequently, you need source attribution, or you want to avoid the cost of retraining. Use fine-tuning when you need the model to learn a new style, format, or reasoning pattern that retrieval can’t provide. In practice, the best systems combine both.

RAG = dynamic knowledgeFine-tune = style / reasoningCombine for best results

Gotcha
What is the “lost in the middle” problem in RAG?

When multiple documents are concatenated as context, LLMs tend to attend best to the beginning and end of the input — information in the middle gets underweighted. Mitigations: rerank retrieved docs by relevance before concatenation, or use a model with a longer, more uniform attention mechanism.

Middle docs underweightedRerank before concatLimit top_k to 3–5

Explain
What does beam search do in generation?

Instead of greedily picking the single best next token at each step, beam search keeps the top N candidate sequences (beams) in parallel and returns the highest-scoring complete sequence. num_beams=4 explores 4 paths simultaneously — better quality than greedy, with manageable compute overhead.

num_beams=1 = greedynum_beams=4 = qualityHigher beams = slower

Improve
How would you modernize this RAG stack for production?

Three upgrades: (1) Replace DPR + BART with a single LLM via Bedrock or OpenAI — better quality, less infrastructure. (2) Replace FAISS with pgvector or Pinecone for managed, scalable retrieval. (3) Add a reranker (cross-encoder) between retrieval and generation to reorder results by relevance before injecting as context.

Bedrock / OpenAI LLMpgvector / PineconeCross-encoder reranker

RAG: Retrieval-Augmented Generation.