How to Build Your First RAG Pipeline (Step by Step)

Large language models are brilliant conversationalists with one inconvenient flaw: they will confidently make things up about your data, because they have never seen it. Ask a raw model about your company’s refund policy and it will improvise something that sounds plausible and is completely wrong.

Retrieval-Augmented Generation (RAG) is the fix, and it is refreshingly un-magical. Instead of fine-tuning a model or praying to the hallucination gods, you simply fetch the relevant facts first and hand them to the model along with the question. The model stops guessing and starts reading. Think of it as giving your LLM an open-book exam instead of making it cram.

Let’s build a minimal, end-to-end pipeline in Python. Five stages: load, chunk, embed, retrieve, generate.

Step 1: Install the toolkit

We’ll keep dependencies lean. ChromaDB handles embeddings and vector storage in one tidy package (it bundles sentence-transformers under the hood), and the anthropic SDK does the generation.

pip install chromadb anthropic

Step 2: Load and chunk your documents

You can’t shove a 40-page PDF into a prompt and hope for the best. You break documents into chunks small enough to embed precisely but large enough to retain meaning.

Chunk size is the dial everyone fiddles with. Too small (say, single sentences) and chunks lose context; too large and retrieval gets fuzzy because one chunk covers five topics. A sweet spot is roughly 300–500 tokens with a little overlap so a sentence split across a boundary isn’t orphaned.

Here’s a deliberately simple splitter to show the idea:

def chunk_text(text, chunk_size=400, overlap=50):
    words = text.split()
    chunks = []
    step = chunk_size - overlap
    for i in range(0, len(words), step):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    return chunks

documents = [
    "Our refund policy allows returns within 30 days of purchase...",
    "Premium support is available 24/7 for enterprise customers...",
    # ...load these from your real files
]

chunks = [c for doc in documents for c in chunk_text(doc)]

In production you’d reach for a recursive splitter that respects paragraph and sentence boundaries, but word-counting gets you surprisingly far for a first build.

Step 3: Embed and store in a vector database

An embedding turns text into a list of numbers that captures its meaning, so “cancel my order” lands near “how do I get a refund” even though they share no words. We store these vectors so we can later search by meaning rather than keywords.

ChromaDB’s default embedding model is all-MiniLM-L6-v2, a small, fast, and genuinely decent model. Using PersistentClient writes everything to disk so you don’t re-embed on every run.

import chromadb

client = chromadb.PersistentClient(path="./rag_store")
collection = client.get_or_create_collection(name="knowledge_base")

collection.add(
    documents=chunks,
    ids=[f"chunk-{i}" for i in range(len(chunks))],
)

That’s it. Chroma embeds each chunk for you and indexes the vectors. (One gotcha: ids must be unique. Re-running add with existing IDs is fine; reusing an ID with new text silently keeps the old one.)

Step 4: Retrieve the top-k chunks

When a question arrives, you embed it the same way and ask the store for its nearest neighbours. n_results is your top-k, the number of chunks you pull back.

def retrieve(question, k=3):
    results = collection.query(query_texts=[question], n_results=k)
    return results["documents"][0]

context_chunks = retrieve("Can I return something I bought three weeks ago?")

Top-k is the other dial worth tuning. Too low (k=1) and you risk missing the chunk that actually held the answer; too high and you flood the prompt with noise, raise your token bill, and sometimes distract the model. Start at k=3 to 5 and adjust based on how chunky your documents are.

Step 5: Build the augmented prompt and generate

Now we stitch the retrieved context into the prompt and let the model answer from it. The instruction to stick to the context is doing real work here. It’s what turns “creative writing” into “reading comprehension.”

from anthropic import Anthropic

llm = Anthropic()  # reads ANTHROPIC_API_KEY from the environment

def answer(question):
    context = "\n\n".join(retrieve(question))
    prompt = (
        "Answer the question using ONLY the context below. "
        "If the context doesn't contain the answer, say you don't know.\n\n"
        f"Context:\n{context}\n\nQuestion: {question}"
    )
    response = llm.messages.create(
        model="claude-haiku-4-5",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.content[0].text

print(answer("Can I return something I bought three weeks ago?"))

Three weeks is inside the 30-day window, so the model answers correctly, grounded in your actual policy rather than its imagination. Swap in claude-sonnet-4-6 for harder reasoning, or any other provider’s chat model; the pattern is identical.

The honest gotchas

Garbage in, garbage out. RAG can only retrieve what you stored. If a fact isn’t in your documents, no amount of clever prompting conjures it.
Chunk size and top-k interact. Tiny chunks usually want a higher k; big chunks want a lower one. Tune them together.
Always tell the model to admit ignorance. “Say you don’t know” is the cheapest hallucination guard you’ll ever write.
Persist your vectors. Re-embedding a corpus on every restart is a slow, expensive habit.

Your takeaway checklist

Building your first RAG pipeline comes down to five moves:

Load your documents.
Chunk them (~300–500 tokens, small overlap).
Embed and store the chunks in a vector database.
Retrieve the top 3–5 most similar chunks for each query.
Augment the prompt with that context and generate.

Get this loop working on a handful of documents first. Once your LLM is answering from your data instead of its dreams, then worry about scale, hybrid search, and reranking. Open-book exams beat cramming every single time.

How to Build Your First RAG Pipeline (Step by Step)

Step 1: Install the toolkit

Step 2: Load and chunk your documents

Step 3: Embed and store in a vector database

Step 4: Retrieve the top-k chunks

Step 5: Build the augmented prompt and generate

The honest gotchas

Your takeaway checklist

More posts

Reading Learning Curves: Diagnosing Model Training

Model Explainability: Making Sense of SHAP and LIME

Diffusion Models: How AI Generates Images