Exercise 2: Eval Pipeline ("pytest for LLMs")¶
Pairs with Stage 7 — Multi-Agent & Production Exercise 2.
Task¶
Write 5 eval cases for a production agent, run a baseline, track regression. Without eval, you ship blind.
The 5 cases cover: 1-2. Math (deterministic answers) 3-4. Geography (factual recall) 5. Grounding test (fake word "flrgglemerk" — agent should say "don't know", not hallucinate)
Two evaluators:
| Method | When | Cost |
|---|---|---|
| String match | Deterministic substring expected | $0, instant |
| LLM-as-judge | Open-ended answers (recommendation / explanation) | One extra LLM call |
How to run¶
Path A (default, free, local)¶
pip install -r requirements.txt
ollama pull qwen2.5:3b
ollama serve
python starter.py
Path B (Anthropic)¶
pip install -r requirements.txt
export ANTHROPIC_API_KEY=sk-ant-...
python starter_anthropic.py
Budget: 5 cases × 1 call ≈ $0.003 (Claude haiku).
Validate the logic¶
python test.py # 7 tests: evaluators + run_eval aggregation
python test_anthropic.py # Anthropic agent mock
Production value of eval¶
Without eval:
PR merge → ship → user complains → only then you discover the regression
With eval:
PR → run eval → pass_rate drops 95% → 70% → block merge
→ find which cases regressed → fix prompt / model / retry → recover
Pin a baseline: capture the initial pass_rate (e.g., 80%) — every ship must not regress.
Classic eval shape¶
eval_cases = [
{"id": ..., "input": ..., "expected_substring": ..., "instruction": ...},
...
]
def run_eval(cases, agent_fn, eval_fn):
results = [...]
return {"pass_count": ..., "pass_rate": ...}
Three keys:
1. id required — pinpoint which case regressed
2. expected_substring not full match — LLM answers have variability
3. Eval function decoupled from agent — swap evaluators against the same cases
When to use LLM-as-judge¶
| Scenario | Substring | LLM-as-judge |
|---|---|---|
| "2+2=?" | ✅ "4" | overkill |
| "summarize this article" | ❌ no fixed substring | ✅ |
| "is the tone professional?" | ❌ | ✅ |
| "count tokens used" | ✅ regex | overkill |
Empirical rule: 80% of cases use substring + heuristics; 20% use LLM-as-judge (more cost / latency).
Production-ready tools¶
- promptfoo: YAML config + CLI runner + diff reports
- Anthropic Workbench eval: official UI, prompts as code
- LangSmith: LangChain ecosystem eval + observability
- Weights & Biases Weave: generic LLM eval framework
- Braintrust: cross-model / version A/B, dashboards built for production use
Path observations¶
| Observation | Anthropic Claude | Ollama qwen2.5:3b |
|---|---|---|
| Math pass rate | ~100% | ~80% |
| Geography pass rate | ~100% | ~70-90% |
| Grounding test (flrgglemerk) | Stays grounded, says don't know | Occasionally fabricates |
| Overall pass_rate | 95-100% | 70-85% |
Takeaway: production should build a 50-200-case eval set against your specific use case to decide which model.
Common pitfalls¶
- Eval set too small (< 10): noise dominates, regressions invisible
- Eval set too close to training data: model memorizes, real user queries fail
- No grounding test: production hallucination is the deadliest bug — always test "should say I don't know"
expected_substringtoo strict: "The capital is Tokyo, Japan." as expected, "Tokyo" as answer = fail. Match only key tokens- LLM-as-judge bias: same model as agent + judge → self-preference. Use a different model for judge
Extensions¶
- Track regression: write
{"date": ..., "pass_rate": ...}to sqlite, plot trend - CI integration: GitHub Actions runs eval,
pass_rate < 90%blocks merge - A/B model comparison: same eval, run qwen / Claude / GPT, compare accuracy
- Connect to observability (Exercise 3): eval failures → alert