Exercise 5: Deploy (FastAPI + Docker)¶

Pairs with Stage 7 — Multi-Agent & Production Exercise 5.

Task¶

Package the agent as a production-style HTTP API:

FastAPI app with /health + /chat endpoints
Structured logging with request_id
Proper HTTP status codes (200 / 422 / 429 / 503 / 500)
Pydantic schema validation (FastAPI free)
Dockerfile (covers both Ollama and Anthropic deploys)

How to run¶

Local Ollama¶

pip install -r requirements.txt
ollama pull qwen2.5:3b
ollama serve

uvicorn starter:app --reload --port 8000

# In another shell:
curl -X POST http://localhost:8000/chat \
  -H 'Content-Type: application/json' \
  -d '{"message": "hi"}'

Local Anthropic¶

pip install -r requirements.txt
export ANTHROPIC_API_KEY=sk-ant-...
uvicorn starter_anthropic:app --reload --port 8000

Docker¶

docker build -t agent-api .

# Ollama path (host must run ollama)
docker run -p 8000:8000 \
  -e OLLAMA_API_BASE=http://host.docker.internal:11434/v1 \
  agent-api

# Anthropic path
docker run -p 8000:8000 \
  -e APP_MODULE=starter_anthropic:app \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  agent-api

Validate without starting the server¶

python test.py             # 5 tests via fastapi.TestClient
python test_anthropic.py   # 3 tests (incl. 429 rate limit)

fastapi.TestClient uses in-process ASGI — no real port, no Docker.

Production essentials¶

Element	Why	In this starter
`/health` endpoint	K8s liveness/readiness probes	✅
`request_id` per call	Trace / debug	✅ uuid4
Structured logging	ELK / Datadog / Loki parseable	✅ JSON-like format
Pydantic schema validation	Malformed JSON → 422 automatically	✅ FastAPI built-in
Specific exception → HTTP status	503 ≠ 500 — client knows whether to retry	✅ APIConnectionError → 503
Token tracking in response	Cost / token usage transparency	✅ Path B includes input/output tokens

Status codes¶

Situation	HTTP code	Client should
LLM answered	200	Use answer
Missing `message` field	422	Fix request, don't retry
Anthropic rate limit (429)	429	Exponential backoff retry
LLM service disconnected	503	Retry (transient)
Other unexpected	500	Log + alert, don't auto-retry

Deploy targets¶

Target	Good for	Watch out
Local uvicorn	Dev	1 worker, not for prod
Docker + uvicorn	Small prod	Add `--workers N`, put nginx in front
K8s	Scalable prod	Use `/health` for liveness/readiness
AWS Lambda + API Gateway	Sporadic traffic	Slow cold starts, fits light agents
Cloud Run / Fargate	Mid-scale prod	Scale-to-zero, simple
Anthropic Computer Use / Skills	Very specific use cases	See Stage 5

Common pitfalls¶

No health check: load balancer can't detect dead instances
Heavy /health: calling the LLM to verify = wasted cost + slow startup gets you killed
Missing request_id: traces scattered across logs, can't correlate
All errors → 500: client can't distinguish transient (retry) vs permanent. Use specific codes
Synchronous LLM call in def: FastAPI blocks the event loop. Use async def + await client.messages.create(...) or a thread pool
No rate limiting: attackers or buggy clients explode your LLM bill. Add slowapi / nginx rate limit
Hard-coded secrets: API key in code = git leak. Use env vars + secret manager

Connecting back to earlier exercises¶

Exercise 3 observability: add TraceContext to endpoint, log latency / tokens / errors per request
Exercise 2 eval: post-deploy CI eval, pass_rate < 90% triggers rollback
Exercise 4 caching: cache_control on the system prompt — 90% cost cut immediately
Stage 6 RAG: endpoint wires up vector DB + memory store

Extensions¶

Streaming endpoint: @app.post("/chat/stream") with StreamingResponse + SSE format
Auth: FastAPI Depends(verify_token) + JWT / API key
Cost limit: per-user / per-day token cap, reject above limit
OpenTelemetry: tracer.start_as_current_span("chat_endpoint") ships traces to Datadog
K8s manifests: Deployment + Service + HPA + ConfigMap