Stage 7 — Multi-Agent · Productionization¶
⏱ Estimated Time: 2-4 weeks (approx. 15-30 hours)
💡 High density of terminology (multi-agent / handoff / eval / observability / guardrails...) → Refer to
resources/glossary.md4 + 6.📋 Chapter Composition: [What is Multi-Agent · Productionization (Positioning) + Three-layer engineering split + When to use multi-agent] → Learning Objectives → Entry Conditions → Required Reading → Harness Engineering (8 core components including Cost/Latency) → Hands-on Exercises (including Exercise 6 Cost Optimization) → Agent Benchmark Landscape: how to read it, not just the leaderboard → Recommended Tools → Featured Projects → Self-Check 🔑 Key Terms: See
resources/glossary.md4 + 6 (multi-agent / orchestration / handoff / eval / observability / harness (the execution and control layer around the model))
This is the final stage. You are moving from "I can build an agent" to "I can make an agent truly stable for people to use"—with multiple agents collaborating, with eval, with observability, and deployable to a usable environment. "Productionization" ≠ enterprise scale—as long as an agent can produce stable output and be used by others, it falls within the scope of this stage.
🎯 What Is Multi-Agent · Productionization (Positioning)¶
This stage = how multiple agents collaborate + how to push an agent from prototype to the point where others can use it stably. Three sentences clarify the scope:
- It is not just about learning frameworks — Stage 4 already taught how to choose frameworks
- It does not have to be enterprise scale — as long as an agent can be used by others, it counts as productionization
- The core is harness engineering — 8 core components + eval + observability + cost / latency control
Division of labor with adjacent stages:
- Stage 4 = how to choose a single-agent framework, patterns like ReAct / Plan-Execute
- This stage = multi-agent collaboration + harness engineering (execution-system engineering) + deployment to a usable environment / observability / eval
The three-layer engineering split: Prompt → Context → Harness¶
Engineering work can be split into three layers, corresponding to different positions in the stack (not the difference between one call and many calls):
| Layer | Concept | Core question | Unit of concern | Corresponding stage |
|---|---|---|---|---|
| 1 | Prompt Engineering | How should I ask this time? | single LLM call | Stage 2 |
| 2 | Context Engineering | What information should the model receive this time? | context across multiple interactions | Stage 6 |
| 3 | Harness Engineering (This stage) |
How does the whole workflow run? | executable LLM workflow / system | This stage |
Plain-language difference: - Prompt = design a good way of asking so the model answers correctly this time - Context = dynamically decide which background, memory, documents, and tool results to include so the model understands the current situation - Harness = connect prompt, context, tools, state, flow control, and error handling into a system that can actually run
This stage's 3 core questions:
- Multi-agent collaboration — debate / planner-executor / peer review / handoff / supervisor-worker patterns
- Harness Engineering — agent loop / tool registry (the list of tools an agent can call + interface definitions) / context manager / safety / retry / telemetry / eval / cost (8 core components, detailed below)
- Productionization — eval harness / observability / cost & latency optimization / deployment to a usable environment
Division of labor with Stage 5 (to avoid confusion):
| Comparison | What's Covered There | What's Covered in This Stage |
|---|---|---|
| Stage 5.5 Subagents | Claude Code's native subagent mechanism (markdown-based, no coding) | General multi-agent frameworks (autogen / crewAI / langgraph, vendor-agnostic) |
| Stage 5.6 Claude Code source | Claude Code source dissection (reference implementation case study) | General harness engineering principles (not tied to a specific vendor) |
⚠ But do you really need multi-agent?¶
Multi-agent is not the default; it is a design you use only when the task truly needs it. In most scenarios, you should first try a simple workflow or a single agent; only when the task is naturally decomposable, needs parallel exploration, a single context is not enough, or explicit role separation is needed, is multi-agent worth introducing. Forcing it brings 3-10× token cost, difficult debugging, and severe context fragmentation (context gets split across multiple agents, and no one sees the whole picture).
📌 The decision framework's canonical home is Stage 4: the full Anthropic / Cognition stance comparison + the 4 "should you go multi-agent" signals + each signal's corresponding pattern live in Stage 4 §When do you really need multi-agent (design-phase decision). This section is only the last sanity check before production — none of the 4 signals present? → a single agent + good prompts + tool use is enough; don't force multi-agent. The harness engineering part of this stage (8 components / eval / observability) still applies even if you end up using a single agent—so this stage is still required reading even if you decide against multi-agent.
📌 Learning Objectives¶
- Design multi-agent orchestration patterns (debate, planner-executor, peer review)
- Build an evaluation harness for your agent
- Add observability (tracing, logging, cost tracking)
- Use the Anthropic SDK / OpenAI SDK for production deployment (advanced features: streaming, prompt caching, batching)
- Deploy an agent to production (Docker, serverless, monitoring)
🚪 Entry Conditions¶
You should already have: - Completed Stage 4 (used at least one agent framework to run a multi-agent demo) - Completed Stage 5 (understand the roles of MCP / Skills / Plugins / Subagents, and have dissected a harness internally in 5.6) - Completed Stage 6 (know basic RAG, can explain the differences between memory patterns) - Basic familiarity with Docker / git / CI (will be used in production deployment)
If not, go back and complete the previous stages. This stage is about "combining everything you've learned → running in production," and any missing piece will be a blocker.
📚 Required Reading¶
- Anthropic — Building Effective Agents — Reread it from a production perspective
- Anthropic — Prompt Caching — A technique for 90% cost reduction
- Anthropic — Message Batches API — Asynchronous batch jobs
- anthropics/courses — Prompt Evaluations ⭐⭐⭐⭐⭐ ★ 21k+ — Anthropic's official 5-course umbrella; module 4 "Prompt Evaluations" maps to this stage's eval / observability section. Jupyter notebooks covering systematic evaluation of prompt and agent behavior.
- Documentation for any eval framework — promptfoo, LangSmith, or weave
- ai-boost/awesome-harness-engineering (★ 940) — A collection of tools / patterns / eval / memory / MCP / observability for agent harnesses
- ZhangHanDong/harness-engineering-from-cc-to-ai-coding (★ 1.3k+) — Learning harness design from Claude Code's source code (in Chinese)
🏗 Harness Engineering — Engineering Design for a Production Agent Runtime ⭐ Core Concept of This Stage¶
Positioning: The execution and control layer around the model¶
To turn an LLM into a usable agent, you usually run into three layers of engineering problems. These three layers correspond to different engineering positions, not simply "one call" versus "many calls."
💡 Simon Willison 2025: "coding agent = LLM + harness"; harness = all the code that is not the model itself.
💡 OpenAI also uses the term "Harness Engineering" in 2026 (see the OpenAI Harness Engineering article, published 2026-02).
| Layer | What you are engineering | Where to learn it |
|---|---|---|
| 1. Prompt Engineering | The string sent into the LLM (system prompt / few-shot / format) | Stage 2 |
| 2. Context Engineering | The information placed inside the window (RAG / memory / tool defs / history assembly) | Stage 6 |
| 3. Harness Engineering (This section) |
The execution and control layer around the model (loop / retry / sandbox / observability / deployment) | This stage |
How do you tell which layer you are working on? Ask:
- Am I changing the string itself? → Prompt engineering
- Am I changing the information put into the window? → Context engineering
- Am I changing the surrounding program that calls the model? → Harness engineering
→ The three layers are orthogonal: a one-call RAG app is still doing context engineering (the key is how the window is assembled); a 50-call chatbot with no retrieval is still only doing prompt engineering.
The 8 Core Components of a Harness¶
Harness Engineering (agent runtime design) = connecting the LLM, tools, memory, state, workflow control, error handling, eval, observability, and deployment into an executable, observable, maintainable agent system.
→ Everything that is not part of the model weights, and is not just the prompt string itself, falls under harness. A production-grade agent runtime includes these 8 core components (the first 6 are built into the runtime, the 7th, eval, is a plug-in tool, and the 8th, cost / latency, is a cross-cutting concern):
| Component | What it Does | Corresponding Exercise in this Stage |
|---|---|---|
| Agent loop | The "LLM → tool → result → LLM" loop, stably handling multiple turns | Exercise 1 Multi-agent debate |
| Tool registry | Dynamic tool dispatch, permission gates, sandboxing | (Present in every framework / SDK) |
| Context manager | Message history management, context window control, auto-compaction | Stage 6 + This stage's Exercise 4 SDK |
| Safety layer | Permission prompts, sandboxed execution, interception of destructive ops | (Built into Claude Code, customizable in SDKs) |
| Retry / recovery | How to handle tool failure (exception vs. the LLM reflecting on the error itself) | Exercise 4 SDK Advanced |
| Telemetry / Observability | Metrics, logging, token counting, trace export | Exercise 3 Observability |
| Eval harness | Regression testing, quality gates, A/B testing | Exercise 2 Eval |
| Cost / Latency optimization ⭐ Required for 2024-2026 | Prompt caching, model routing, thinking budget, batching, semantic cache | Exercise 6 Cost optimization (New) |
Framework vs. Harness: key difference: - Framework (Stage 4) defines the API — what the interface you call looks like - Harness (this section) defines the runtime — how it runs, how it recovers, and how it is observed
Reference Implementations¶
Want to see what a production-grade harness looks like? Two references:
- The entire Claude Code runtime — is a reference harness implementation. For a source-reading exercise, see Stage 5.6 (clone
claude-agent-sdk-pythonand dissect the main loop + where the first 6 runtime components from the table above live; the 7th, Eval harness, is a plugin, and the 8th, Cost / Latency, is cross-cutting, see the deep-dive below) anthropics/claude-agent-sdk-pythonsource — the specific repo used in the exercise above
→ The remaining 6 exercises in this stage (multi-agent / eval / observability / SDK / deploy / cost) each cover one facet of the harness. Completing the full stage = assembling a complete mental model of harness engineering.
Deep dive into the 8th core component — Cost / Latency Optimization (Required for 2024-2026 Productionization)¶
When a production agent runs long enough, cost and latency will eat up most of your budget and user experience. From 2024-2026, frontier models have treated this as a first-class API feature—knowing how to use it = saving 50-90% on cost / latency.
| Technique | How it Saves | 2026 Status |
|---|---|---|
| Prompt caching | Recurring prefixes (system prompt, long context) are billed once, subsequent cache hits get a ~90% discount | Fully supported by Anthropic / OpenAI / Gemini, automatic or manual tagging |
| Model routing / cascade | Simple queries → smaller model, difficult queries → frontier model | Built into RouteLLM / OpenRouter for production |
| Thinking budget | Controllable thinking token limit for reasoning models, trading off latency / quality | Claude / Gemini API parameter, high by default in o-series |
| Speculative decoding | A smaller model predicts N tokens, a larger model validates them at once, ×2-3 speedup for a single model | Built into vLLM / TGI, automatic at the inference layer |
| Batching | Processing multiple queries in parallel for higher GPU utilization | vLLM, production inference layer |
| Semantic caching | Sharing answers for similar queries (not just exact matches) | Built into GPTCache / Helicone |
How Track A can use this (for those using CLI agents):
- Set up prompt caching in Claude Code / Cursor to save 50-90% on daily session costs
- Use RouteLLM / OpenRouter to dynamically switch models (simple questions use Haiku / Flash, difficult ones use Opus / Pro)
- Use the thinking_budget parameter in the Claude API to control the token limit for reasoning models
How Track B can build this (for those writing their own agents): - Build a custom cascade router that maps query embeddings → classifier → model - Monitor token cost within the agent loop and automatically downgrade if the budget is exceeded - Integrate a semantic cache layer in the deployed environment - Observability platforms like Helicone / langfuse already have these features built in, so you do not have to write them yourself
🛠 Hands-on Exercises (Basic Illustrative Exercises)¶
Exercise 1: Multi-Agent Debate¶
Have two agents debate a topic (e.g., "Should we use Python or Rust for the backend?"), with a third agent acting as a referee. Observe the patterns of convergence or divergence in the debate.
Exercise 2: Eval¶
Write an eval for your previous agent, run it N times, and measure the success rate. Replace the habit of "I'll just take a look" with a quantitative approach.
Exercise 3: Observability¶
Connect LangSmith, Helicone, or weave to an agent and view the full trace. Understand that "debugging an agent without observability = a black box".
Exercise 4: Advanced SDK¶
Use streaming + prompt caching + tool use in a single call. See how the costs come down.
Exercise 5: Deploy¶
Package an agent into Docker and deploy it to the cloud (any provider will do). Learn to turn a prototype into something that can be run by others.
Exercise 6: Cost Optimization (New) ⭐¶
Measure the token cost of any of your previous exercise agents, then add prompt caching and measure again. Observe the relationship between cache hit rate and cost reduction. Bonus: Connect to RouteLLM or OpenRouter and implement cascade routing (simple queries → Haiku / difficult queries → Opus), and measure the average cost.
📊 Agent Benchmark Landscape: How to read it, not just the leaderboard + ⚠ Reward-Hacking Warning¶
Before choosing a model or building an agent, you'll want to look at benchmark numbers—but in April 2026, UC Berkeley discovered that all 8 major agent benchmarks can be reward-hacked to ~100%. Below is the 2026 leaderboard status + how to read it without getting misled.
Mainstream Agent Benchmark SOTA as of 2026-05¶
| Benchmark | Domain | 2026-05 SOTA | Leading Model |
|---|---|---|---|
| SWE-bench Verified | Software Engineering / code agent | 87.6% | Claude Opus 4.7 |
| Terminal-Bench | terminal tasks | Leading | Claude Opus 4.5 / 4.7 |
| GAIA | general assistant | 74.6% | Claude Sonnet 4.5 (Princeton HAL) |
| WebArena | web navigation | 68.7% | Claude Mythos Preview |
| OSWorld | OS-level desktop control | 76.26% (SOTA, superhuman vs 72.36% human baseline) | OpenAI CUA 38%, most frontier models still under 50% |
| τ-bench | multi-turn dialogue with tool use | (Harder to hack) | Anthropic / OpenAI leading |
| RE-bench | research engineering | (Harder to hack, close to human baseline) | Frontier models |
→ For detailed rankings + live updates: Agent Benchmark Leaderboard 2026, Rapid Claw AI Agent Framework Scorecard 2026
⚠ Berkeley 2026-04 Reward-Hacking Warning¶
UC Berkeley RDI Report 2026-04-12: An automated scanning agent systematically audited 8 major benchmarks (SWE-bench / WebArena / OSWorld / GAIA / Terminal-Bench / FieldWorkArena / CAR-bench, etc.) and found that every single one could be reward-hacked to nearly 100% without the agent actually solving a single task.
This means that for numbers on the leaderboard like "Claude 87.6% / GPT 85.0%", some X% of that might be hacked, not from genuinely solving the task.
How to Read Benchmarks Without Being Misled¶
| Approach | Recommendation |
|---|---|
| Only look at the leaderboard top | ❌ All 8 above have been proven hackable |
| Look at task-level success rate breakdown | ✅ Most hacks are concentrated in a few tasks |
| Run your own hold-out test set | ✅✅ The most reliable method, a must for production agents |
| Check the trajectory / log to see if the task was really solved | ✅ Distinguishes reward hacking from a genuine solve |
| Look at multiple benchmarks + your own use case | ✅ Don't rely on a single metric |
Which benchmarks are harder to hack (as of 2026-05): - τ-bench — Multi-turn dialogue + tool use, denser reward function - RE-bench — Real-world research engineering tasks - Your own production eval set ⭐ Always the most reliable
💡 The discipline of production agent eval: - Don't take external benchmark numbers as ground truth; they tell you the "upper limit," not "real-world performance." - Your own eval set (internal hold-out test) is the basis for go-live decisions. - Every time a model is upgraded → run it against your internal eval set for validation, don't just look at the vendor's published benchmark improvements. - Connect to langfuse / promptfoo to automate eval and run it with every deployment.
🎯 Recommended Tools for Multi-Agent / Production (by Use Case)¶
Don't know where to start choosing tools? Below are the common pairings in the industry for 2025-2026—use "Scenario" as your entry point, and click the repo link for a deeper dive:
| Scenario | Recommended Tool | Why |
|---|---|---|
| Writing your first multi-agent (fastest to get started) | crewAI | Role-based, get it running in a few lines of code, straightforward production patterns |
| Want a group debate / brainstorm pattern | AutoGen | GroupChat for free-form debate, from Microsoft |
| Need an audit trail / checkpoint / human-in-the-loop for production | LangGraph | State machine approach, most complete control |
| Standardizing eval (a must for CI / regression) | promptfoo ⭐ | YAML config, cross-model comparison, ★ 20k+ |
| Eval + observability on the same platform | langfuse ⭐ | OSS, tracing + eval + prompt mgmt, ★ 26k+ |
| Quick instrumentation without code changes | Helicone | Proxy-based, not tied to a framework |
| Entire stack is on LangChain | LangSmith (Commercial) | Official observability from LangChain |
| Building a Claude agent (programmatically) | claude-agent-sdk-python ⭐ | Official agent SDK from Anthropic, same runtime as Claude Code |
| Deploying an agent as an API service | BentoML | The most complete, Docker + serving |
| Self-hosting an open-source LLM (to replace paid APIs) | vLLM | High-throughput serving, ★ 79k+ |
| Fine-tuning an open-source LLM | LLaMA-Factory | Unified SFT/DPO/PPO/GRPO for 100+ models, no-code Web UI, widest Chinese community, ★ 70k+ |
Suggested adoption sequence: 1. First multi-agent: crewAI (role-based, simplest) 2. Add eval: promptfoo (YAML, CI integration) 3. Add observability: langfuse (OSS, complete) 4. Production upgrade: Switch to LangGraph (stronger control) + BentoML (deploy) 5. Advanced: Self-host LLMs with vLLM, fine-tune with LLaMA-Factory
🎯 Featured Projects (Templates / SDKs / Tool Collections)¶
Categorized by use case, a single table to get you started with 22 projects. Use "Who is it for" as your entry point, and click the repo link for a deeper dive.
| Category | Project | ⭐ | Who is it for | Why it's recommended / Notes |
|---|---|---|---|---|
| Multi-Agent Orchestration | microsoft/autogen | ⭐⭐⭐⭐⭐ | Those who want a GroupChat free-debate pattern | Introduced in Stage 4, revisit for multi-agent debate / brainstorming patterns in production scenarios |
| crewAIInc/crewAI | ⭐⭐⭐⭐⭐ | Those who want a role-based assembly line | Role-based multi-agent (research → writer → reviewer), the simplest production pattern | |
| langchain-ai/langgraph | ⭐⭐⭐⭐⭐ | Those needing an audit trail / checkpoint / human-in-the-loop | State machine approach, strongest for production control | |
| Eval Frameworks | promptfoo ⭐ | ⭐⭐⭐⭐⭐ | To standardize the eval process, CI integration | YAML config, cross-model comparison. ★ 20k+, MIT |
| lm-evaluation-harness | ⭐⭐⭐⭐ | For academic benchmarks (MMLU / HellaSwag / GSM8K) | Academic grade. ★ 12k+, MIT | |
| openai/evals | ⭐⭐⭐⭐ | For OpenAI-specific evals / want to contribute upstream | ★ 18k+ | |
| Observability | langfuse ⭐ | ⭐⭐⭐⭐⭐ | For self-hosting production observability | OSS LangSmith alternative, traces + sessions + evals + prompt mgmt. ★ 26k+, MIT |
| LangSmith (Commercial) | ⭐⭐⭐⭐ | For those with their entire stack on LangChain / LangGraph | Official from LangChain, hosted version only | |
| Helicone | ⭐⭐⭐⭐ | For quick instrumentation without code changes | Proxy-based, get logging + caching for free. ★ 5.7k+, Apache 2.0 | |
| weave (W&B) | ⭐⭐⭐⭐ | For teams already using W&B for ML experiment tracking | W&B tracing + eval, integrates with wandb | |
| Advanced Anthropic SDK | anthropic-sdk-python | ⭐⭐⭐⭐⭐ | For building applications directly on the Claude API | Official Python SDK: streaming / async / tool use / prompt caching / batches / files |
| anthropic-sdk-typescript | ⭐⭐⭐⭐ | For TypeScript / Node / web apps | The TS version of the Python SDK | |
| claude-agent-sdk-python ⭐ | ⭐⭐⭐⭐⭐ | For building Claude-based agents, not just API calls | Built-in tool use loop / file access / sandbox / subagent orchestration; same runtime as Claude Code, read the source to see how it works internally. ★ 6.9k+, MIT | |
| claude-agent-sdk-typescript | ⭐⭐⭐⭐ | For Claude agents in a Node / web app environment | The TS version of the Claude Agent SDK. ★ 1.4k+ | |
| Anthropic Cookbook (Advanced) | ⭐⭐⭐⭐ | For seeing official advanced SDK patterns | Especially the prompt_caching.ipynb / tool_use/ / multimodal/ notebooks |
|
| Deployment | BentoML | ⭐⭐⭐⭐ | For packaging an agent into a production API service | Docker + serving framework. ★ 8k+, Apache 2.0 |
| LangServe | ⭐⭐⭐⭐ | For quickly deploying a LangChain agent | Based on FastAPI | |
| vLLM | ⭐⭐⭐⭐ | For self-hosting an open-source LLM to replace paid APIs | High-throughput LLM serving for Llama / Qwen, etc. ★ 79k+, Apache 2.0 | |
| Chinese Deploy / Fine-tune | datawhalechina/self-llm | ⭐⭐⭐⭐ | For Chinese teams wanting to self-host open-source LLMs | A complete Chinese guide from training-to-deployment, for Qwen / Llama / GLM / multimodal. ★ 30k+, Apache 2.0 |
| hiyouga/LLaMA-Factory | ⭐⭐⭐⭐⭐ | For fine-tuning open-source LLMs (beyond just prompt eng) | Unified SFT/DPO/PPO/GRPO for 100+ models, no-code Web UI, widest Chinese community. ★ 70k+, Apache 2.0 | |
| Multi-Agent Case Studies | geekan/MetaGPT | ⭐⭐⭐⭐⭐ | For seeing role division + artifact handoff patterns | An SOP-based multi-agent team of PM / Architect / Engineer, generating PRD → design → code. ★ 67k+, MIT |
| OpenBMB/ChatDev | ⭐⭐⭐⭐ | For seeing agent debate / peer-review patterns | Conversational software development where agents debate each other on design / code / test. ★ 33k+, Apache 2.0, has a zh README | |
| princeton-nlp/SWE-agent | ⭐⭐⭐⭐ | To understand why tool design > prompt tuning | The Agent-Computer Interface (ACI) design philosophy, backed by a Princeton paper, a leading method on SWE-Bench. ★ 19k+, MIT |
🌳 For Claude's native subagent mechanism (multi-agent without a framework), see Stage 5.5. This stage focuses on frameworks / production; Stage 5.5 focuses on markdown-based subagent orchestration.
✅ Self-Check After Stage 7¶
Can you: - [ ] Design a multi-agent system and clearly explain its collaboration protocol? - [ ] Run an automated eval pipeline in CI? - [ ] Connect observability (tracing) to a production agent? - [ ] Measure the cost difference before and after implementing prompt caching on a real workload? - [ ] Deploy an agent to the cloud (any provider)?
If you can do all of these → first go to Stage 7.5 — Advanced Agentic Concepts Map (1 week, no coding — build a frontier concept map and locate which advanced concepts the industry is still debating), then proceed to Stage 8 — Agent Interfaces (a shared hub for both tracks) to learn how agents interact with the non-API world (Computer Use / Browser Use / Sandbox). Or, pick a specialized branch, or come back and contribute to this repo.
💡 What's Next¶
You now have the foundational skills. For the next 6-12 months, you should focus on: 1. Picking one production system and taking it from prototype to production. 2. Contributing upstream (LangGraph, AutoGen, MCP servers, Anthropic cookbook). 3. Reading papers—agent research is moving fast. 4. Making something tangible—open-source a real tool, stop just writing tutorials.