Skip to content

Exercise 4: Full RAG Pipeline

Pairs with Stage 6 — Memory & RAG Exercise 4.

Task

Tie Exercises 1-3 together:

doc → chunk_doc → embed → ChromaDB → top_k retrieve → LLM generation

Sample KB is a company onboarding doc with 4 sections (vacation / remote / expenses / tech stack).

How to run — two paths

Path A (default, free, local)

pip install -r requirements.txt
ollama pull qwen2.5:3b
ollama serve
python starter.py

Budget: $0.

Path B (Anthropic, cloud-quality answers)

pip install -r requirements.txt
export ANTHROPIC_API_KEY=sk-ant-...
python starter_anthropic.py

Budget: ~$0.001 per run.

Validate the logic

python test.py             # 5 tests — mock LLM, exercise full pipeline
python test_anthropic.py   # Anthropic mock

test.py uses a mock LLM through the full pipeline (chunking → retrieval → generation), confirming the prompt actually includes context and generate sees the retrieval output.

RAG in 4 steps

def rag(query, doc):
    collection = build_kb(doc)           # 1. chunk + embed + index (one-time)
    contexts = retrieve(collection, q)    # 2. top-k semantic search
    answer = generate(q, contexts)        # 3. LLM reads context, answers
    return {"contexts": contexts, "answer": answer}

Each step has independent trade-offs:

Step Main knob Affects
chunk size / overlap / strategy retrieval ceiling
embed model size / multilingual retrieval precision
retrieve top_k / metadata filter / reranker recall vs precision
generate prompt / model / temperature answer quality

Generation prompt pattern

prompt = f"""Answer the user's question based ONLY on the context below.
If the context doesn't contain the answer, say "I don't have that information".

Context:
{context_text}

Question: {query}

Answer:"""

Three key instructions: 1. based ONLY on context — prevents hallucination 2. if missing → say so — gives the LLM an out, no forced answers 3. Context then Question — models prefer this layout

Path comparison

Observation Anthropic Claude haiku Ollama qwen2.5:3b
Grounding in context Stable (sticks to context) Sometimes drifts, fills with general knowledge
"I don't have that info" rate High (follows rules) Low (forces an answer)
Fluency High Medium
Multi-context integration Good Sometimes only looks at the first
Speed 1-3s 5-15s on CPU
Cost $0.001 $0

Production reality: RAG quality = retrieval quality × generation quality. Retrieval miss → LLM hallucinates; retrieval good but LLM weak → low-quality answer. Stage 7 production often uses local / mid-size for retrieval and Claude / GPT for generation.

Common pitfalls

  • No "only based on context" instruction: LLM goes off-script, fills from training data — uncontrolled
  • top_k too high: long context, attention diffuses, wrong answers
  • top_k too low: misses key sections, can't answer
  • Context after the question: LLMs weight the start of the prompt more; put context first
  • No eval for "say unknown when you can't answer": production needs 5-10 eval cases for this

Production-ready RAG

  • Persistent ChromaDB: chromadb.PersistentClient(path=...) to skip re-indexing
  • Reranker: retrieve top-20, cross-encoder rerank, keep top-3
  • Citation: prompt "cite which context section you used", LLM tags [chunk_0]
  • Streaming: client.chat.completions.create(stream=True)
  • LangGraph integration: turn retrieve → generate into graph nodes with fallback path

Extensions

  • Query rewriting: LLM rewrites the user query into something better for retrieval (HyDE pattern)
  • Multi-hop RAG: first retrieve gives partial answer, use partial answer to retrieve more
  • Plug into Exercise 5 long-term memory: dialogue history also goes into vector store