Stage 2 — Prompt Engineering¶
⏱ Time estimate: 1-2 weeks (~5-12 hours)
👋 Coming from Stage 1? Good — you can call an API. The next 5-12 hours: write reusable structured prompts, use few-shot and chain-of-thought for hard reasoning tasks, and quantify prompt improvement with evals. Jumped straight here? Make sure you can call an LLM API and estimate cost in tokens — if not, head back to Stage 1.
💡 Term-unfamiliar? (prompt / few-shot / CoT / system prompt / …) → see
resources/glossary.en.md.
📌 Learning Goals¶
After this stage you will be able to: - Write structured prompts (role + task + format + examples) - Apply few-shot prompting and know when it helps - Use chain-of-thought (CoT) for reasoning tasks - Iteratively refine a prompt and measure improvement - Recognize when prompting hits its limit (and you need tools / agents)
🚪 Entry Conditions¶
You should already: - Be able to call an LLM API (Stage 1) - Be able to parse / iterate over API responses
📚 Required Reading¶
- anthropics/prompt-eng-interactive-tutorial ⭐⭐⭐⭐⭐ ★ 35k+ — Anthropic's official interactive tutorial, 9 chapters of Jupyter notebooks (basic / intermediate / advanced + appendix), with playground and answer key. Runs on Claude 3 Haiku (cheapest). The canonical hands-on resource for Stage 2. Also packaged as module 2 of the anthropics/courses 5-course umbrella — for broader coverage (API Fundamentals / Real World Prompting / Eval / Tool Use) go straight to the umbrella
- anthropics/courses — Real World Prompting ⭐⭐⭐⭐ ★ 21k+ — Module 3 of the same umbrella, "how to actually use prompting in real situations": chatbot / legal / financial / coding case walkthroughs. Read #1 first, then this.
- Anthropic Prompt Engineering Guide — official docs, pairs with #1
- OpenAI Prompt Engineering — OpenAI's perspective
- dair-ai Prompt Engineering Guide — academic-flavored, in-depth
- Anthropic — Prompting Best Practices — be clear and direct
🛠 Hands-on Exercises¶
🦙 This stage defaults to Ollama gemma4:e4b (cost-driven; $0/run). Prompt engineering is especially instructive on small models — they are sensitive to prompt quality, so you can clearly see how much each technique (system prompts, few-shot, CoT, refinement) improves output. Every exercise has Path A (Ollama, default) + Path B (Anthropic, optional).
💰 Stage 2 budget estimate (4 exercises, 3-5 runs each): all local = $0, all haiku ≈ $0.20, all sonnet ≈ $0.60. The few-shot classifier alone is 12 calls × 5 reps ≈ $0.30 haiku / $0.90 sonnet. Full budget:
examples/README.en.md#recommended-llm-list.Full three-path trade-off in
examples/README.en.md.
Exercise 1: System Prompt¶
Same user message, three different system prompts. Watch the personality / output format change.
📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_1.py)
# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
SYSTEM_PROMPTS = {
"Strict lawyer": "You are a precise contract lawyer. Cite statute numbers, avoid subjective adjectives.",
"Kindergarten teacher": "You are a kind kindergarten teacher speaking to a 5-year-old. Use analogies, colloquial language, under 80 words.",
"JSON machine": "Reply only in JSON. schema: {\"answer\": string, \"confidence\": float}",
}
USER_MSG = "Explain what a lease agreement is."
outputs = {}
for label, system in SYSTEM_PROMPTS.items():
# Ollama (OpenAI-compatible) puts system in the messages array (Anthropic uses system=)
r = client.chat.completions.create(
model="gemma4:e4b",
max_tokens=200,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": USER_MSG},
],
)
outputs[label] = r.choices[0].message.content
print(f"\n--- [{label}] ---")
print(outputs[label])
# === Self-check ===
import json
last_text = outputs["JSON machine"]
assert "{" in last_text and "}" in last_text, "JSON-machine output should contain JSON braces"
try:
parsed = json.loads(last_text.strip().split("\n")[-1] if "\n" in last_text else last_text)
assert "answer" in parsed, "schema expects an 'answer' field"
except json.JSONDecodeError:
pass # some models add prose around the JSON; tolerate that
print(f"\n✅ Exercise 1 passed — same question, three different personas / formats / tones")
📋 Starter code — Path B (Anthropic API, optional) (copy to practice_1_anthropic.py)
# Requires: pip install anthropic
import sys, json
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPTS = {
"Strict lawyer": "You are a precise contract lawyer. Cite statute numbers, avoid subjective adjectives.",
"Kindergarten teacher": "You are a kind kindergarten teacher speaking to a 5-year-old. Use analogies, colloquial language, under 80 words.",
"JSON machine": "Reply only in JSON. schema: {\"answer\": string, \"confidence\": float}",
}
USER_MSG = "Explain what a lease agreement is."
outputs = {}
for label, system in SYSTEM_PROMPTS.items():
# Anthropic uses `system=` parameter (not part of messages array)
msg = client.messages.create(model="claude-haiku-4-5", max_tokens=200,
system=system, messages=[{"role": "user", "content": USER_MSG}])
outputs[label] = msg.content[0].text
print(f"\n--- [{label}] ---")
print(outputs[label])
# Self-check (same JSON-shape assert; schema is cross-backend)
json_output = outputs["JSON machine"]
assert "{" in json_output and "}" in json_output
print(f"\n✅ Exercise 1 passed (Anthropic)")
Exercise 2: Few-Shot¶
Pick a classification task. Run it 0-shot, then 3-shot. Measure accuracy difference.
📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_2.py)
# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# Sentiment classifier: positive / negative / neutral
TEST_SET = [
("This movie was amazing — I want to watch it again!", "positive"),
("Boring plot, awkward acting.", "negative"),
("This is a 2019 film.", "neutral"),
("Not sure how I feel about it, might think more.", "neutral"),
("Season 1 was great but season 2 fell apart.", "negative"),
("Left in a great mood — recommended!", "positive"),
]
FEW_SHOT_EXAMPLES = """Examples:
input: The steak at this place made me cry tears of joy.
output: positive
input: The waiter was rude. Never coming back.
output: negative
input: This shop is in New Taipei City.
output: neutral
"""
def classify(text: str, *, use_few_shot: bool) -> str:
prefix = FEW_SHOT_EXAMPLES + "\n" if use_few_shot else ""
prompt = f"{prefix}input: {text}\noutput:"
r = client.chat.completions.create(
model="gemma4:e4b",
max_tokens=10,
messages=[{"role": "user", "content": prompt}],
)
return r.choices[0].message.content.strip().splitlines()[0]
def evaluate(use_few_shot: bool) -> tuple[int, int]:
correct = 0
for text, label in TEST_SET:
pred = classify(text, use_few_shot=use_few_shot)
ok = label in pred
print(f" {'✓' if ok else '✗'} [{label}] {text[:30]}... → '{pred}'")
if ok:
correct += 1
return correct, len(TEST_SET)
print("=== 0-shot ===")
c0, n = evaluate(use_few_shot=False)
print(f"correct {c0}/{n} = {c0/n:.0%}")
print("\n=== 3-shot ===")
c3, _ = evaluate(use_few_shot=True)
print(f"correct {c3}/{n} = {c3/n:.0%}")
# === Self-check ===
print(f"\n✅ Exercise 2 passed — 0-shot {c0}/{n}, 3-shot {c3}/{n}")
assert c3 >= c0, f"expected 3-shot ≥ 0-shot, got {c3} < {c0}"
📋 Starter code — Path B (Anthropic API, optional) (copy to practice_2_anthropic.py)
# Requires: pip install anthropic
import sys
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
import anthropic
client = anthropic.Anthropic()
# Same TEST_SET / FEW_SHOT_EXAMPLES as Path A — only the classify() body changes:
def classify(text: str, *, use_few_shot: bool) -> str:
prefix = FEW_SHOT_EXAMPLES + "\n" if use_few_shot else ""
msg = client.messages.create(
model="claude-haiku-4-5",
max_tokens=10,
messages=[{"role": "user", "content": f"{prefix}input: {text}\noutput:"}],
)
return msg.content[0].text.strip().splitlines()[0]
# Rest of TEST_SET / FEW_SHOT_EXAMPLES / evaluate() stays identical to Path A
Exercise 3: CoT¶
Pick a math word problem. Compare: - Plain prompt - Plain prompt + "Let's think step by step" - Plain prompt + worked example showing CoT
📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_3.py)
# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys, re
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
QUESTION = "Tom has 3 apples. He gives Sarah 1, then mom gives him 5 more, then he eats 2. How many does he have now?"
ANSWER = 5 # 3 - 1 + 5 - 2 = 5
COT_EXAMPLE = """Example:
Q: A chicken has 2 legs. 3 chickens and 1 person — how many legs total?
A: Let me work through this step by step. 3 chickens × 2 legs = 6 legs. 1 person has 2 legs. Total 6 + 2 = 8 legs. The answer is 8.
"""
def ask(prompt: str) -> str:
r = client.chat.completions.create(
model="gemma4:e4b",
max_tokens=300,
messages=[{"role": "user", "content": prompt}],
)
return r.choices[0].message.content
def extract_number(text: str) -> int | None:
"""Pull the last number from the response as the answer."""
nums = re.findall(r"-?\d+", text)
return int(nums[-1]) if nums else None
out_a = ask(QUESTION)
ans_a = extract_number(out_a)
out_b = ask(QUESTION + "\nLet's think step by step.")
ans_b = extract_number(out_b)
out_c = ask(COT_EXAMPLE + "\n\nQ: " + QUESTION + "\nA:")
ans_c = extract_number(out_c)
for label, out, ans in [("A plain", out_a, ans_a), ("B +step-by-step", out_b, ans_b), ("C +CoT example", out_c, ans_c)]:
print(f"\n--- [{label}] answer={ans} {'✓' if ans == ANSWER else '✗'} ---")
print(out[:200])
# === Self-check ===
correct = sum(1 for a in (ans_a, ans_b, ans_c) if a == ANSWER)
assert correct >= 1, f"at least 1 of 3 prompts should be correct, got {correct}/3"
assert ans_b == ANSWER or ans_c == ANSWER, "B (step-by-step) or C (CoT example) must be correct — CoT is non-negotiable for small models"
print(f"\n✅ Exercise 3 passed — {correct}/3 correct")
📋 Starter code — Path B (Anthropic API, optional) (copy to practice_3_anthropic.py)
Same logic as Path A, just swap the client and `ask()`:
import anthropic
client = anthropic.Anthropic()
def ask(prompt: str) -> str:
msg = client.messages.create(model="claude-haiku-4-5", max_tokens=300,
messages=[{"role": "user", "content": prompt}])
return msg.content[0].text
# Rest (QUESTION, ANSWER, COT_EXAMPLE, extract_number, 3 calls, assert) stays identical
Exercise 4: Iterative Refinement¶
Take a vague prompt, refine it 5 times. Track the iterations. Notice what changes improve quality.
📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_4.py) — this exercise has no "right answer"; the point is observing the process
# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# 5 iterations, each adds one constraint
PROMPTS = {
"v1 vague": "Write a paragraph about ReAct.",
"v2 +audience": "Write a paragraph about ReAct for software engineers who know Python.",
"v3 +format": "Write a paragraph about ReAct for software engineers who know Python. Under 100 words, single paragraph.",
"v4 +example": "Write a paragraph about ReAct for software engineers who know Python. Under 100 words, single paragraph, ending with a concrete example (e.g. weather lookup).",
"v5 +bans": "Write a paragraph about ReAct for software engineers who know Python. Under 100 words, single paragraph, ending with a concrete example (e.g. weather lookup). Avoid words like 'empower', 'leverage', 'intelligent'.",
}
outputs = {}
for label, prompt in PROMPTS.items():
r = client.chat.completions.create(
model="gemma4:e4b",
max_tokens=200,
messages=[{"role": "user", "content": prompt}],
)
text = r.choices[0].message.content
outputs[label] = text
print(f"\n--- [{label}] ({len(text)} chars) ---")
print(text)
# === Self-check ===
v1_len, v5_len = len(outputs["v1 vague"]), len(outputs["v5 +bans"])
banned_words = ("empower", "leverage", "intelligent")
v5_text_lower = outputs["v5 +bans"].lower()
v5_has_banned = any(w in v5_text_lower for w in banned_words)
assert v5_len > 0, "v5 must have output"
assert not v5_has_banned, f"v5 should avoid banned words; got: {[w for w in banned_words if w in v5_text_lower]}"
print(f"\n✅ Exercise 4 passed — v5 length {v5_len}, no banned words")
print(f"💡 Observe: v1 ({v1_len} chars) is typically looser than v5 ({v5_len} chars); constraints tighten prompts")
print("💡 The 5 dimensions: (1) target audience (2) format (3) length (4) example demand (5) banned words")
📋 Starter code — Path B (Anthropic API, optional) (copy to practice_4_anthropic.py)
Same loop and PROMPTS as Path A, with Anthropic SDK:
import anthropic
client = anthropic.Anthropic()
outputs = {}
for label, prompt in PROMPTS.items():
msg = client.messages.create(model="claude-haiku-4-5", max_tokens=200,
messages=[{"role": "user", "content": prompt}])
outputs[label] = msg.content[0].text
# Rest (length compare, banned-word assert) stays identical
🎯 Curated Projects¶
dair-ai/Prompt-Engineering-Guide¶
| Field | Value |
|---|---|
| Stars | ★ 60k+ |
| License | MIT |
| Recommendation | ⭐⭐⭐⭐⭐ |
What it teaches: End-to-end prompt engineering from basics to advanced (CoT, ToT, ReAct, RAG). Academic-flavored but practical.
Best for: Reference. Skim once, return when you need a specific technique.
f/awesome-chatgpt-prompts¶
| Field | Value |
|---|---|
| Stars | ★ 130k+ |
| License | CC0 |
| Recommendation | ⭐⭐⭐ |
What it teaches: Hundreds of role-based prompts. "Act as a [role]..." patterns.
Best for: Inspiration when stuck. Don't copy verbatim — adapt the patterns.
PromptingGuide.ai¶
What it teaches: Same content as dair-ai's GitHub but in website format with live examples.
Best for: Mobile reading.
microsoft/prompt-engine¶
| Recommendation | ⭐⭐⭐ |
|---|---|
What it teaches: TypeScript library for managing prompts at scale (templating, conversation history).
Best for: When you start managing many prompts in production.
microsoft/promptflow¶
| Field | Value |
|---|---|
| Stars | ★ 10k+ |
| Recommendation | ⭐⭐⭐ |
What it teaches: Visual prompt design + evaluation tooling.
Best for: Teams building prompt-heavy apps with eval needs.
GoogleCloudPlatform/generative-ai¶
| Recommendation | ⭐⭐⭐ |
|---|---|
What it teaches: Google Cloud's prompting cookbook (notebooks, PaLM/Gemini focus).
Best for: Cross-vendor perspective if you use Google's stack.
Anthropic Cookbook — Prompt patterns¶
Already cited in Stage 1. Specifically the misc/prompt_caching.ipynb and multimodal/ notebooks teach advanced prompting patterns.
stanfordnlp/dspy¶
| Field | Value |
|---|---|
| Language | Python |
| Stars | ★ 34k+ |
| License | MIT |
| Recommendation | ⭐⭐⭐⭐⭐ |
What it teaches: Prompt-as-code — define signatures + modules, optimize prompts via compilers / teleprompters instead of hand-tuning f-strings. The natural Stage 2 → Stage 3 bridge. From Stanford NLP.
Best for: Readers who finished dair-ai's guide and ask "how do I scale prompts beyond hard-coded strings?"
Notes: It's a framework, not a tutorial — higher learning bar than prompt-engineering-guide. Pair with the official tutorial site dspy.ai.
NirDiamant/Prompt_Engineering¶
| Field | Value |
|---|---|
| Language | Python / Jupyter |
| Stars | ★ 7k+ |
| License | NOASSERTION (custom terms, research/non-commercial — read before use) |
| Recommendation | ⭐⭐⭐⭐ |
What it teaches: 22 prompt-engineering techniques as runnable Jupyter notebooks (zero-shot → CoT → ReAct → constitutional). 2025 vintage, more hands-on than dair-ai.
Best for: Learners who prefer "run-and-learn." Each technique is a standalone notebook — pick whatever interests you.
🔭 Advanced: The Three Layers of Prompt → Context → Harness Engineering¶
Engineering practice for LLM-powered systems can be divided into three stack layers. This is not about "one call vs. many calls." Each layer engineers a different object:
- Prompt Engineering (this stage) = engineering the string sent into the model
- Context Engineering (Stage 6) = engineering what information goes into the context window on each call — dynamically assembling RAG retrieval results, memory, tool definitions, and conversation history
- Harness Engineering (Stage 7) = engineering the execution and control layer around the model — agent loops, retry, sandboxing, observability, deployment, and all other non-LLM code
→ The three layers are orthogonal: a one-call RAG app is still doing context engineering (the point is assembling context, not how many calls happen); a 50-call chatbot with no retrieval is still only doing prompt engineering.
Full three-layer lineage in this roadmap:
| Discipline | What is being engineered | Where this roadmap teaches it fully |
|---|---|---|
| 1. Prompt Engineering | The string sent into the LLM itself (system prompt / few-shot / format) | This stage (Stage 2) |
| 2. Context Engineering | What information goes into the context window (RAG / memory / tool defs / history) | Stage 6 — Context Engineering: RAG and Memory |
| 3. Harness Engineering | The execution and control layer around the model (agent loop / retry / sandbox / observability) | Stage 7 — Multi-Agent · Productionization |
💡 Karpathy 2025-06: context engineering is the delicate art of putting information that is just useful for the next step into the context window.
💡 Simon Willison / Addy Osmani: "coding agent = LLM + harness"; a harness is all the code that is not the model itself. OpenAI also used the term "Harness Engineering" in February 2026.
You do not need to finish the latter two layers in this stage. This section only gives you the direction so that Stage 6 / 7 feel like a continuation of the same lineage.
Further reading (optional, for when you want to dig deeper):
Meirtz/Awesome-Context-Engineering(★ 3k+) — comprehensive survey from prompt engineering to production agentsWindy3f3f3f3f/how-claude-code-works(★ 2.4k+) — Claude Code internals, includes a context-engineering chapter
✅ Self-Check Before Stage 3¶
Can you: - [ ] Write a prompt with system message + user message + 3 example messages (few-shot) - [ ] Demonstrate CoT improving accuracy on a reasoning task - [ ] Iteratively refine a prompt 5 times tracking each version - [ ] Identify when prompting is the wrong tool (and tool use is needed)
If yes → proceed to Stage 3 — Tool Use & Agent Intro. This is the most important stage — don't rush past prompts but also don't get stuck here.