Skip to content

Stage 2 — Prompt Engineering

繁體中文 | 简体中文 | English

Time estimate: 1-2 weeks (~5-12 hours)

👋 Coming from Stage 1? Good — you can call an API. The next 5-12 hours: write reusable structured prompts, use few-shot and chain-of-thought for hard reasoning tasks, and quantify prompt improvement with evals. Jumped straight here? Make sure you can call an LLM API and estimate cost in tokens — if not, head back to Stage 1.

💡 Term-unfamiliar? (prompt / few-shot / CoT / system prompt / …) → see resources/glossary.en.md.

📌 Learning Goals

After this stage you will be able to: - Write structured prompts (role + task + format + examples) - Apply few-shot prompting and know when it helps - Use chain-of-thought (CoT) for reasoning tasks - Iteratively refine a prompt and measure improvement - Recognize when prompting hits its limit (and you need tools / agents)

🚪 Entry Conditions

You should already: - Be able to call an LLM API (Stage 1) - Be able to parse / iterate over API responses

📚 Required Reading

  1. anthropics/prompt-eng-interactive-tutorial ⭐⭐⭐⭐⭐ ★ 35k+ — Anthropic's official interactive tutorial, 9 chapters of Jupyter notebooks (basic / intermediate / advanced + appendix), with playground and answer key. Runs on Claude 3 Haiku (cheapest). The canonical hands-on resource for Stage 2. Also packaged as module 2 of the anthropics/courses 5-course umbrella — for broader coverage (API Fundamentals / Real World Prompting / Eval / Tool Use) go straight to the umbrella
  2. anthropics/courses — Real World Prompting ⭐⭐⭐⭐ ★ 21k+ — Module 3 of the same umbrella, "how to actually use prompting in real situations": chatbot / legal / financial / coding case walkthroughs. Read #1 first, then this.
  3. Anthropic Prompt Engineering Guide — official docs, pairs with #1
  4. OpenAI Prompt Engineering — OpenAI's perspective
  5. dair-ai Prompt Engineering Guide — academic-flavored, in-depth
  6. Anthropic — Prompting Best Practices — be clear and direct

🛠 Hands-on Exercises

🦙 This stage defaults to Ollama gemma4:e4b (cost-driven; $0/run). Prompt engineering is especially instructive on small models — they are sensitive to prompt quality, so you can clearly see how much each technique (system prompts, few-shot, CoT, refinement) improves output. Every exercise has Path A (Ollama, default) + Path B (Anthropic, optional).

💰 Stage 2 budget estimate (4 exercises, 3-5 runs each): all local = $0, all haiku ≈ $0.20, all sonnet ≈ $0.60. The few-shot classifier alone is 12 calls × 5 reps ≈ $0.30 haiku / $0.90 sonnet. Full budget: examples/README.en.md#recommended-llm-list.

Full three-path trade-off in examples/README.en.md.

Exercise 1: System Prompt

Same user message, three different system prompts. Watch the personality / output format change.

📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_1.py)
# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

SYSTEM_PROMPTS = {
    "Strict lawyer": "You are a precise contract lawyer. Cite statute numbers, avoid subjective adjectives.",
    "Kindergarten teacher": "You are a kind kindergarten teacher speaking to a 5-year-old. Use analogies, colloquial language, under 80 words.",
    "JSON machine": "Reply only in JSON. schema: {\"answer\": string, \"confidence\": float}",
}

USER_MSG = "Explain what a lease agreement is."

outputs = {}
for label, system in SYSTEM_PROMPTS.items():
    # Ollama (OpenAI-compatible) puts system in the messages array (Anthropic uses system=)
    r = client.chat.completions.create(
        model="gemma4:e4b",
        max_tokens=200,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": USER_MSG},
        ],
    )
    outputs[label] = r.choices[0].message.content
    print(f"\n--- [{label}] ---")
    print(outputs[label])

# === Self-check ===
import json
last_text = outputs["JSON machine"]
assert "{" in last_text and "}" in last_text, "JSON-machine output should contain JSON braces"
try:
    parsed = json.loads(last_text.strip().split("\n")[-1] if "\n" in last_text else last_text)
    assert "answer" in parsed, "schema expects an 'answer' field"
except json.JSONDecodeError:
    pass  # some models add prose around the JSON; tolerate that

print(f"\n✅ Exercise 1 passed — same question, three different personas / formats / tones")
📋 Starter code — Path B (Anthropic API, optional) (copy to practice_1_anthropic.py)
# Requires: pip install anthropic
import sys, json
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

import anthropic
client = anthropic.Anthropic()

SYSTEM_PROMPTS = {
    "Strict lawyer": "You are a precise contract lawyer. Cite statute numbers, avoid subjective adjectives.",
    "Kindergarten teacher": "You are a kind kindergarten teacher speaking to a 5-year-old. Use analogies, colloquial language, under 80 words.",
    "JSON machine": "Reply only in JSON. schema: {\"answer\": string, \"confidence\": float}",
}
USER_MSG = "Explain what a lease agreement is."

outputs = {}
for label, system in SYSTEM_PROMPTS.items():
    # Anthropic uses `system=` parameter (not part of messages array)
    msg = client.messages.create(model="claude-haiku-4-5", max_tokens=200,
                                 system=system, messages=[{"role": "user", "content": USER_MSG}])
    outputs[label] = msg.content[0].text
    print(f"\n--- [{label}] ---")
    print(outputs[label])

# Self-check (same JSON-shape assert; schema is cross-backend)
json_output = outputs["JSON machine"]
assert "{" in json_output and "}" in json_output
print(f"\n✅ Exercise 1 passed (Anthropic)")
**Key difference**: Anthropic uses `system=` parameter; OpenAI/Ollama puts system in messages array. Claude follows system prompts more strictly than gemma4:e4b — the "Strict lawyer" persona will actually cite statute numbers. **Cost**: ~$0.003 / 3 personas.

Exercise 2: Few-Shot

Pick a classification task. Run it 0-shot, then 3-shot. Measure accuracy difference.

📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_2.py)
# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# Sentiment classifier: positive / negative / neutral
TEST_SET = [
    ("This movie was amazing — I want to watch it again!", "positive"),
    ("Boring plot, awkward acting.", "negative"),
    ("This is a 2019 film.", "neutral"),
    ("Not sure how I feel about it, might think more.", "neutral"),
    ("Season 1 was great but season 2 fell apart.", "negative"),
    ("Left in a great mood — recommended!", "positive"),
]

FEW_SHOT_EXAMPLES = """Examples:
input: The steak at this place made me cry tears of joy.
output: positive

input: The waiter was rude. Never coming back.
output: negative

input: This shop is in New Taipei City.
output: neutral
"""


def classify(text: str, *, use_few_shot: bool) -> str:
    prefix = FEW_SHOT_EXAMPLES + "\n" if use_few_shot else ""
    prompt = f"{prefix}input: {text}\noutput:"
    r = client.chat.completions.create(
        model="gemma4:e4b",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}],
    )
    return r.choices[0].message.content.strip().splitlines()[0]


def evaluate(use_few_shot: bool) -> tuple[int, int]:
    correct = 0
    for text, label in TEST_SET:
        pred = classify(text, use_few_shot=use_few_shot)
        ok = label in pred
        print(f"  {'✓' if ok else '✗'} [{label}] {text[:30]}... → '{pred}'")
        if ok:
            correct += 1
    return correct, len(TEST_SET)


print("=== 0-shot ===")
c0, n = evaluate(use_few_shot=False)
print(f"correct {c0}/{n} = {c0/n:.0%}")

print("\n=== 3-shot ===")
c3, _ = evaluate(use_few_shot=True)
print(f"correct {c3}/{n} = {c3/n:.0%}")

# === Self-check ===
print(f"\n✅ Exercise 2 passed — 0-shot {c0}/{n}, 3-shot {c3}/{n}")
assert c3 >= c0, f"expected 3-shot ≥ 0-shot, got {c3} < {c0}"
📋 Starter code — Path B (Anthropic API, optional) (copy to practice_2_anthropic.py)
# Requires: pip install anthropic
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

import anthropic
client = anthropic.Anthropic()

# Same TEST_SET / FEW_SHOT_EXAMPLES as Path A — only the classify() body changes:
def classify(text: str, *, use_few_shot: bool) -> str:
    prefix = FEW_SHOT_EXAMPLES + "\n" if use_few_shot else ""
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=10,
        messages=[{"role": "user", "content": f"{prefix}input: {text}\noutput:"}],
    )
    return msg.content[0].text.strip().splitlines()[0]
# Rest of TEST_SET / FEW_SHOT_EXAMPLES / evaluate() stays identical to Path A
**Cost**: 12 calls ≈ $0.005. Claude is usually accurate at 0-shot already, so the few-shot lift is smaller than on gemma4:e4b — that contrast is the actual teaching point.

Exercise 3: CoT

Pick a math word problem. Compare: - Plain prompt - Plain prompt + "Let's think step by step" - Plain prompt + worked example showing CoT

📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_3.py)
# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys, re
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

QUESTION = "Tom has 3 apples. He gives Sarah 1, then mom gives him 5 more, then he eats 2. How many does he have now?"
ANSWER = 5  # 3 - 1 + 5 - 2 = 5

COT_EXAMPLE = """Example:
Q: A chicken has 2 legs. 3 chickens and 1 person — how many legs total?
A: Let me work through this step by step. 3 chickens × 2 legs = 6 legs. 1 person has 2 legs. Total 6 + 2 = 8 legs. The answer is 8.
"""


def ask(prompt: str) -> str:
    r = client.chat.completions.create(
        model="gemma4:e4b",
        max_tokens=300,
        messages=[{"role": "user", "content": prompt}],
    )
    return r.choices[0].message.content


def extract_number(text: str) -> int | None:
    """Pull the last number from the response as the answer."""
    nums = re.findall(r"-?\d+", text)
    return int(nums[-1]) if nums else None


out_a = ask(QUESTION)
ans_a = extract_number(out_a)

out_b = ask(QUESTION + "\nLet's think step by step.")
ans_b = extract_number(out_b)

out_c = ask(COT_EXAMPLE + "\n\nQ: " + QUESTION + "\nA:")
ans_c = extract_number(out_c)

for label, out, ans in [("A plain", out_a, ans_a), ("B +step-by-step", out_b, ans_b), ("C +CoT example", out_c, ans_c)]:
    print(f"\n--- [{label}] answer={ans} {'✓' if ans == ANSWER else '✗'} ---")
    print(out[:200])

# === Self-check ===
correct = sum(1 for a in (ans_a, ans_b, ans_c) if a == ANSWER)
assert correct >= 1, f"at least 1 of 3 prompts should be correct, got {correct}/3"
assert ans_b == ANSWER or ans_c == ANSWER, "B (step-by-step) or C (CoT example) must be correct — CoT is non-negotiable for small models"
print(f"\n✅ Exercise 3 passed — {correct}/3 correct")
📋 Starter code — Path B (Anthropic API, optional) (copy to practice_3_anthropic.py) Same logic as Path A, just swap the client and `ask()`:
import anthropic
client = anthropic.Anthropic()

def ask(prompt: str) -> str:
    msg = client.messages.create(model="claude-haiku-4-5", max_tokens=300,
                                 messages=[{"role": "user", "content": prompt}])
    return msg.content[0].text
# Rest (QUESTION, ANSWER, COT_EXAMPLE, extract_number, 3 calls, assert) stays identical
**Claude typically gets 3/3 right** including the plain-prompt baseline — that contrast with gemma4:e4b (where CoT is essential) is the actual teaching point. **Cost**: 3 calls ≈ $0.002.

Exercise 4: Iterative Refinement

Take a vague prompt, refine it 5 times. Track the iterations. Notice what changes improve quality.

📋 Starter code — Path A (local Ollama gemma4:e4b, default) (copy to practice_4.py) — this exercise has no "right answer"; the point is observing the process
# Requires: pip install openai
# Pre-req: ollama pull gemma4:e4b && ollama serve
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8", errors="replace")

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# 5 iterations, each adds one constraint
PROMPTS = {
    "v1 vague": "Write a paragraph about ReAct.",
    "v2 +audience": "Write a paragraph about ReAct for software engineers who know Python.",
    "v3 +format": "Write a paragraph about ReAct for software engineers who know Python. Under 100 words, single paragraph.",
    "v4 +example": "Write a paragraph about ReAct for software engineers who know Python. Under 100 words, single paragraph, ending with a concrete example (e.g. weather lookup).",
    "v5 +bans": "Write a paragraph about ReAct for software engineers who know Python. Under 100 words, single paragraph, ending with a concrete example (e.g. weather lookup). Avoid words like 'empower', 'leverage', 'intelligent'.",
}

outputs = {}
for label, prompt in PROMPTS.items():
    r = client.chat.completions.create(
        model="gemma4:e4b",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}],
    )
    text = r.choices[0].message.content
    outputs[label] = text
    print(f"\n--- [{label}] ({len(text)} chars) ---")
    print(text)

# === Self-check ===
v1_len, v5_len = len(outputs["v1 vague"]), len(outputs["v5 +bans"])
banned_words = ("empower", "leverage", "intelligent")
v5_text_lower = outputs["v5 +bans"].lower()
v5_has_banned = any(w in v5_text_lower for w in banned_words)
assert v5_len > 0, "v5 must have output"
assert not v5_has_banned, f"v5 should avoid banned words; got: {[w for w in banned_words if w in v5_text_lower]}"
print(f"\n✅ Exercise 4 passed — v5 length {v5_len}, no banned words")
print(f"💡 Observe: v1 ({v1_len} chars) is typically looser than v5 ({v5_len} chars); constraints tighten prompts")
print("💡 The 5 dimensions: (1) target audience (2) format (3) length (4) example demand (5) banned words")
📋 Starter code — Path B (Anthropic API, optional) (copy to practice_4_anthropic.py) Same loop and PROMPTS as Path A, with Anthropic SDK:
import anthropic
client = anthropic.Anthropic()

outputs = {}
for label, prompt in PROMPTS.items():
    msg = client.messages.create(model="claude-haiku-4-5", max_tokens=200,
                                 messages=[{"role": "user", "content": prompt}])
    outputs[label] = msg.content[0].text
# Rest (length compare, banned-word assert) stays identical
**Cost**: 5 calls ≈ $0.002. **Claude's v1 is already coherent** so the v5 lift is smaller; gemma4:e4b makes the lift more dramatic.

🎯 Curated Projects

dair-ai/Prompt-Engineering-Guide

Field Value
Stars ★ 60k+
License MIT
Recommendation ⭐⭐⭐⭐⭐

What it teaches: End-to-end prompt engineering from basics to advanced (CoT, ToT, ReAct, RAG). Academic-flavored but practical.

Best for: Reference. Skim once, return when you need a specific technique.


f/awesome-chatgpt-prompts

Field Value
Stars ★ 130k+
License CC0
Recommendation ⭐⭐⭐

What it teaches: Hundreds of role-based prompts. "Act as a [role]..." patterns.

Best for: Inspiration when stuck. Don't copy verbatim — adapt the patterns.


PromptingGuide.ai

What it teaches: Same content as dair-ai's GitHub but in website format with live examples.

Best for: Mobile reading.


microsoft/prompt-engine

Recommendation ⭐⭐⭐

What it teaches: TypeScript library for managing prompts at scale (templating, conversation history).

Best for: When you start managing many prompts in production.


microsoft/promptflow

Field Value
Stars ★ 10k+
Recommendation ⭐⭐⭐

What it teaches: Visual prompt design + evaluation tooling.

Best for: Teams building prompt-heavy apps with eval needs.


GoogleCloudPlatform/generative-ai

Recommendation ⭐⭐⭐

What it teaches: Google Cloud's prompting cookbook (notebooks, PaLM/Gemini focus).

Best for: Cross-vendor perspective if you use Google's stack.


Anthropic Cookbook — Prompt patterns

Already cited in Stage 1. Specifically the misc/prompt_caching.ipynb and multimodal/ notebooks teach advanced prompting patterns.


stanfordnlp/dspy

Field Value
Language Python
Stars ★ 34k+
License MIT
Recommendation ⭐⭐⭐⭐⭐

What it teaches: Prompt-as-code — define signatures + modules, optimize prompts via compilers / teleprompters instead of hand-tuning f-strings. The natural Stage 2 → Stage 3 bridge. From Stanford NLP.

Best for: Readers who finished dair-ai's guide and ask "how do I scale prompts beyond hard-coded strings?"

Notes: It's a framework, not a tutorial — higher learning bar than prompt-engineering-guide. Pair with the official tutorial site dspy.ai.


NirDiamant/Prompt_Engineering

Field Value
Language Python / Jupyter
Stars ★ 7k+
License NOASSERTION (custom terms, research/non-commercial — read before use)
Recommendation ⭐⭐⭐⭐

What it teaches: 22 prompt-engineering techniques as runnable Jupyter notebooks (zero-shot → CoT → ReAct → constitutional). 2025 vintage, more hands-on than dair-ai.

Best for: Learners who prefer "run-and-learn." Each technique is a standalone notebook — pick whatever interests you.


🔭 Advanced: The Three Layers of Prompt → Context → Harness Engineering

Engineering practice for LLM-powered systems can be divided into three stack layers. This is not about "one call vs. many calls." Each layer engineers a different object:

  • Prompt Engineering (this stage) = engineering the string sent into the model
  • Context Engineering (Stage 6) = engineering what information goes into the context window on each call — dynamically assembling RAG retrieval results, memory, tool definitions, and conversation history
  • Harness Engineering (Stage 7) = engineering the execution and control layer around the model — agent loops, retry, sandboxing, observability, deployment, and all other non-LLM code

→ The three layers are orthogonal: a one-call RAG app is still doing context engineering (the point is assembling context, not how many calls happen); a 50-call chatbot with no retrieval is still only doing prompt engineering.

Full three-layer lineage in this roadmap:

Discipline What is being engineered Where this roadmap teaches it fully
1. Prompt Engineering The string sent into the LLM itself (system prompt / few-shot / format) This stage (Stage 2)
2. Context Engineering What information goes into the context window (RAG / memory / tool defs / history) Stage 6 — Context Engineering: RAG and Memory
3. Harness Engineering The execution and control layer around the model (agent loop / retry / sandbox / observability) Stage 7 — Multi-Agent · Productionization

💡 Karpathy 2025-06: context engineering is the delicate art of putting information that is just useful for the next step into the context window.

💡 Simon Willison / Addy Osmani: "coding agent = LLM + harness"; a harness is all the code that is not the model itself. OpenAI also used the term "Harness Engineering" in February 2026.

You do not need to finish the latter two layers in this stage. This section only gives you the direction so that Stage 6 / 7 feel like a continuation of the same lineage.

Further reading (optional, for when you want to dig deeper):

✅ Self-Check Before Stage 3

Can you: - [ ] Write a prompt with system message + user message + 3 example messages (few-shot) - [ ] Demonstrate CoT improving accuracy on a reasoning task - [ ] Iteratively refine a prompt 5 times tracking each version - [ ] Identify when prompting is the wrong tool (and tool use is needed)

If yes → proceed to Stage 3 — Tool Use & Agent Intro. This is the most important stage — don't rush past prompts but also don't get stuck here.