Exercise 4: Advanced SDK (streaming + prompt caching)¶

Pairs with Stage 7 — Multi-Agent & Production Exercise 4.

Two SDK features production needs¶

Streaming — send tokens to UI as they're generated; user sees first token in 0.3-1s instead of waiting for full answer
Prompt caching (Anthropic-only) — repeated long system prompts / tools / context save 90% cost

How to run¶

Path A (default, free, local, streaming demo)¶

pip install -r requirements.txt
ollama pull qwen2.5:3b
ollama serve
python starter.py

Path B (Anthropic, streaming + caching)¶

pip install -r requirements.txt
export ANTHROPIC_API_KEY=sk-ant-...
python starter_anthropic.py

Budget: streaming + caching demo ≈ $0.005 (2 calls + cached ~2000 tokens).

Validate the logic¶

python test.py             # 3 tests, mock OpenAI streaming
python test_anthropic.py   # mock Anthropic streaming + cache_control

Streaming¶

OpenAI / Ollama¶

stream = client.chat.completions.create(
    model=..., messages=[...],
    stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Anthropic¶

with client.messages.stream(
    model=..., max_tokens=300, messages=[...]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

UX impact: non-streaming = 5s wait for the answer; streaming = first token in 0.5s. Perception is dramatically different.

Prompt caching (Anthropic-only)¶

resp = client.messages.create(
    model="claude-haiku-4-5",
    system=[
        {
            "type": "text",
            "text": "[2000-token reference material...]",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "..."}]
)

First call: cache_creation_input_tokens=2000 (25% write premium) Subsequent calls within 5 min: cache_read_input_tokens=2000 (10% cost = 90% off)

When to use: - Long system prompts called repeatedly (chatbots) - Tool schemas reused across calls (multi-tool agents) - Same document queried multiple times (RAG on a fixed doc)

When not: - Every prompt is unique - Fewer than 1 call per 5 min (cache expires)

Production math¶

For an agent at 1000 req/min with 5000-token system prompt:

Mode	Input cost / req	Monthly (30 days)
No caching	5000 × $1/M = $0.005	$216,000
With caching	500 × $1/M = $0.0005	$21,600

90% savings — this is why production agents universally use caching.

Path observations¶

Observation	Anthropic Claude	Ollama qwen2.5:3b
Streaming	✅ smooth	✅ smooth
First-token latency	0.3-0.8s	0.5-2s (CPU)
Prompt caching	✅ 90% off	❌ no API
Production fit	Full caching + streaming	Mostly for dev / local demo

Common pitfalls¶

Streaming¶

Forgetting flush=True: buffered output, user still waits
Not handling None deltas: first/last chunks may have delta.content is None — skip them
Mid-stream disconnection: catch + restart
Token counting: streaming responses may not include usage — tokenize or sum chunks yourself

Prompt caching¶

cache_control in the wrong place: attach to the segment you want cached. Can cache system + tools + first few messages simultaneously
Cache key includes model name: switching haiku → sonnet invalidates
5-minute TTL: low-QPS scenarios expire often, pay 25% premium without saving
Minimum 1024 tokens: shorter content won't cache

Extensions¶

Streaming + tool use: tool_use blocks also stream; use event_type to dispatch
Anthropic Batch API: non-realtime work, batch costs 50% less, 24h turnaround (great for eval, bulk processing)
Files API: upload 100MB+ docs, combine with cache_control
OpenAI Responses API: OpenAI also has prompt caching (different API, automatic) — different rules
Wire to observability (Exercise 3): log cache_read_input_tokens to track cache hit rate