Skip to content

Exercise 1: Multi-Agent Debate

Pairs with Stage 7 — Multi-Agent & Production Exercise 1.

Task

Three agents (PRO + CON + Judge) debate the same question:

Multi-agent debate: PRO / CON / Judge

PRO and CON are called independently — they don't see each other's arguments (prevents bias propagation). The Judge sees both and decides.

Why this pattern matters

  • Reduces single-LLM bias: one LLM tends to bake in a stance and ignore counterarguments
  • Strengthens reasoning: forcing both sides to articulate produces cleaner traces
  • Auditability: high-stakes production decisions (policy / medical / legal review) need trails
  • Disagreement = signal: when agents disagree, the question may be ambiguous or the model uncertain

How to run

Path A (default, free, local)

pip install -r requirements.txt
ollama pull qwen2.5:3b
ollama serve
python starter.py

Budget: $0. Three LLM calls × CPU ≈ 15-45s.

Path B (Anthropic)

pip install -r requirements.txt
export ANTHROPIC_API_KEY=sk-ant-...
python starter_anthropic.py

Budget: ~$0.003 per run (3 calls × short prompts × claude-haiku-4-5).

Validate the logic

python test.py             # 3 tests with mock LLM, verify judge sees pro+con
python test_anthropic.py

Key design points

# Same model, different system prompts
pro = llm_call(system="argue PRO position", user=question)
con = llm_call(system="argue CON position", user=question)

# Judge sees question + pro + con
judge = llm_call(
    system="neutral judge, output WINNER=PRO or WINNER=CON",
    user=f"Question: {question}\n\nPRO: {pro}\n\nCON: {con}",
)

Key: PRO and CON are independent calls. Don't pass PRO's output into CON — CON would then react to PRO rather than think independently, amplifying bias.

Production-ready variants

  • N-way debate: 3+ agents holding different perspectives (e.g., "engineer / PM / customer view")
  • Iterative debate: PRO and CON see each other and rebut for N rounds; first to concede loses
  • Different models: PRO uses Claude, CON uses GPT, Judge uses Gemini — cross-model debate finds blind spots
  • Self-consistency: run the debate 3 times, see how stable the Judge's verdict is

Path observations

Observation Anthropic Claude Ollama qwen2.5:3b
PRO / CON hold their positions Stable Sometimes both turn "balanced" — no clear stance
Judge outputs clear WINNER Stable Occasionally skips the WINNER= format
Reasoning quality High Medium
Cost $0.003 $0

Common pitfalls

  • Identical system prompt for PRO and CON: outputs converge, debate is meaningless
  • Fixed PRO-then-CON order in Judge prompt: may bias toward whichever comes first (recency / primacy). Production should shuffle
  • No structured Judge output: without WINNER=PRO or CON format, downstream parsing is painful
  • Prompts too short: 1-sentence PRO and CON give the Judge nothing to weigh

Extensions

  • Plug into LangGraph: PRO/CON become parallel nodes, Judge a join
  • Use AutoGen: AutoGen has first-class multi-agent debate support
  • Add confidence: Judge outputs confidence 0-1; low confidence escalates to a human
  • Plug into eval (Exercise 2): run debate on 50 cases vs. single-agent baseline