Glossary¶
The roadmap leans heavily on terms like LLM, RAG, MCP, agent. Look up unfamiliar ones here in 30 seconds, then go back to reading the stage.
Each entry gives the smallest usable definition (30-80 words + which stage covers it in depth) — not Wikipedia.
🌐 Unified terminology table (English ↔ Chinese, consistent across stages)¶
This table is the project's enforced naming convention — every stage uses the same Chinese readability label for each English term. If you see drift inside a stage, please file an issue.
| English term | Chinese readability label | Primary stage |
|---|---|---|
| Prompt Engineering | Prompt 設計 / Prompt 设计 | Stage 2 |
| Context Engineering | 上下文管理 | Stage 6 |
| Harness Engineering | Agent 執行系統設計 / Agent 执行系统设计 | Stage 7 |
| Tool Use | 工具使用 | Stage 3 |
| Function Calling | 函式 / 函数 / 工具呼叫 | Stage 3 |
| Structured Output | 結構化輸出 / 结构化输出 | Stage 3 |
| Agent Loop | Agent 執行迴圈 / Agent 执行循环 | Stage 3 |
| Framework | 框架 | Stage 4 |
| Orchestration | 協調與編排 / 协调与编排 | Stage 4 / 7 |
| Handoff | 任務交接 / 任务交接 | Stage 7 |
| Supervisor / Worker | 協調者 / 執行者 (协调者 / 执行者) | Stage 7 |
| Runtime | 執行層 / 执行层 | Stage 7 |
| Scaffolding | 支撐架構 / 支撑架构 | Stage 7 |
| Observability | 觀測與紀錄 / 观测与记录 | Stage 7 |
| Telemetry | 運行紀錄 / 运行记录 | Stage 7 |
| Eval | 效果評估 / 效果评估 | Stage 7 |
| Evaluation Harness | 評估框架 / 评估框架 | Stage 7 |
| Production | 可穩定使用 / 上線化 (可稳定使用 / 上线化) | Stage 7 |
| Production-grade | 可長期穩定使用的 / 可长期稳定使用的 | Stage 7 |
| Deployment | 部署 | Stage 7 |
| Cost Tracking | 成本追蹤 / 成本追踪 | Stage 7 |
| Latency | 延遲 / 等待時間 (延迟 / 等待时间) | Stage 7 |
| Vector DB | 向量資料庫 / 向量数据库 | Stage 6 |
| Retrieval | 檢索 / 检索 | Stage 6 |
| Reranking | 重排序 | Stage 6 |
| Long Context | 長上下文 / 长上下文 | Stage 6 |
| Fine-tuning | 模型微調 / 模型微调 | Stage 6 |
| Agent Interfaces | Agent 操作介面 / Agent 操作界面 | Stage 8 |
| Code Sandbox | 隔離程式執行環境 / 隔离程序执行环境 | Stage 8 |
| Cold Start | 啟動延遲 / 启动延迟 | Stage 8 |
| Reward Hacking | 鑽評分漏洞 / 钻评分漏洞 | Stage 7 / 8 |
→ For full definitions, see the sections below.
1. Basic concepts¶
LLM (Large Language Model)¶
GPT, Claude, Gemini — models that take text in and produce text out. Fundamentally a pure function: input prompt → output text. They don't browse the web, they don't remember past conversations — those need to be wired up externally.
📍 Detail: Stage 1
Token¶
LLMs see tokens (sub-word units), not characters. Roughly 1 English word ≈ 1.3 tokens, 1 Chinese character ≈ 1.5–2 tokens. LLM pricing and context windows are measured in tokens. "1M-token context" ≈ 750k English words.
📍 Detail: Stage 1
Context Window¶
The maximum tokens an LLM can "see" in one call. 2026 frontier: Claude Sonnet 4.6 / Opus 4.7 1M, GPT-5.5 ~400k, Gemini 3.1 Pro 2M. Bigger isn't always better — beyond a length the LLM gets "Lost in the Middle".
Prompt¶
The text you feed an LLM. Prompt engineering = designing that text to get good answers. Basic structure: system prompt (role/rules) + user prompt (the actual ask).
📍 Detail: Stage 2
Few-shot / Zero-shot¶
- Zero-shot: ask directly without examples.
- Few-shot: give 2–5 input → output examples first. Few-shot usually improves accuracy a lot, especially for strict formatting.
Chain-of-Thought (CoT)¶
Make the LLM "think before answering" — have it output the reasoning process before the conclusion. Two common forms:
- Few-shot CoT (Wei et al. 2022): put a few examples with reasoning steps into the prompt, and the LLM imitates that style of thinking
- Zero-shot CoT (Kojima et al. 2022): add "Let's think step by step" at the end of the prompt to trigger a reasoning trace
Accuracy usually improves, at the cost of more tokens. Few-shot usually beats zero-shot.
2. Agents / Tool Use¶
Agent¶
A system centered on an LLM that can perceive state → make decisions → take actions → observe results in a loop, repeating until the goal is completed. Three core elements:
- LLM (reasoning / planning / deciding)
- Actions (ways to do things — not limited to function calls. This can include writing and running code (CodeAct), operating a browser (computer use), retrieving from a KB (RAG retrieval), calling an MCP server, or pure planning / task decomposition)
- Loop (the heartbeat — the fundamental difference between an agent and plain LLM Q&A)
The difference is this: plain LLM = Q&A; agent = the three elements + a continuing loop until the goal is reached or the budget runs out. ReAct is one agent pattern, not the definition of an agent — CodeAct, computer-use, and planning agents are all agents.
📍 Detail: Stage 3
Tool Use / Function Calling¶
Lets the LLM call functions you defined (DB lookup, math, browser, …). Instead of plain text, the LLM returns {"function": "search", "args": {…}}. Your code executes it and feeds the result back to the LLM.
The concept is the same; the API schema differs:
- Anthropic "Tool Use": uses input_schema (JSON Schema directly)
- OpenAI / Ollama "Function Calling": wraps it in an outer {"type": "function", "function": {...}}
- The token representation the LLM sees differs internally, so when writing a cross-vendor SDK you need to map them correctly
📍 Detail: Stage 3 📍 How to write good schemas: Function Schema Design cheatsheet
ReAct (Reasoning + Acting)¶
The classic agent pattern: Thought → Action (call tool) → Observation (see result) → Thought ... loop until done. Most agent frameworks implement this internally.
📍 Detail: Stage 3
Structured Output¶
Make the LLM output JSON or another fixed schema instead of free text. All major LLM APIs have response_format or similar. Agent frameworks rely on it for LLM ↔ code communication.
Agent Loop¶
The "LLM → tool → result → LLM" repeated cycle. Termination: LLM says "done" / step budget exhausted / cost cap hit.
Self-Refine (Basic reflection / no memory)¶
The agent evaluates the previous round's output and changes the next round's behavior — an "Actor answers → Critic finds issues → Actor reads feedback and answers again" single-session loop. It does not need a persistent memory layer; it is purely a reasoning-loop mechanism, a sibling pattern to ReAct. Production agents (Cursor / Cline / Claude Code) run variants of this every day.
Representative paper: Self-Refine (Madaan 2023). For the full Reflexion version (with episodic memory), see 3 Memory / Retrieval / RAG.
📍 Detail + routing: Stage 3 Reflection
3. Memory / Retrieval / RAG¶
Memory — Two Orthogonal Axes¶
"Memory" often gets lumped together, but there are actually 2 orthogonal classification axes:
- Time axis: short-term (current conversation) vs long-term (persistent across sessions)
- Content axis (CoALA framework): Working (scratch space) / Episodic (past experiences) / Semantic (factual knowledge) / Procedural (how to do things)
→ The two axes do not conflict: long-term memory can contain at the same time episodic memory (what the user said last time), semantic memory (facts from the company's knowledge base), and procedural memory (tool sequences that worked before).
📍 Detail: Stage 6 What is Memory + How to Design It + Stage 6 CoALA Framework
RAG (Retrieval-Augmented Generation)¶
Two-stage architectural pattern:
- Ingest (one-time / periodic): document → chunk → embed → store in a vector store (build a retrievable KB)
- Query (every user question): embed the question → semantic search (or hybrid + BM25) → top-K chunks → put them into the prompt → LLM answers
Solves the problem that the LLM does not know your private / changing / stale data. Retrieval is not limited to dense embeddings — the production default is hybrid (dense + BM25) + reranker.
📍 Detail: Stage 6 📍 Paper: Lewis et al. 2020
Reflexion (Full reflection / with episodic memory)¶
Unlike Self-Refine (2 Agents), Reflexion requires a persistent episodic memory store — after each trial, the agent writes a reflection summary into memory, then retrieves it into the prompt at the start of the next trial. Accumulating lessons across trials is the essence of Reflexion (not a single-session loop).
It is placed in 3 instead of 2 because it is fundamentally a memory pattern — the episodic memory store is core, not optional.
Representative paper: Reflexion (Shinn 2023).
📍 Detail: Stage 6 Advanced: Full Reflexion with Persistent Memory
Embedding¶
Turn text / images into N-dimensional vectors so that things with similar meanings are close together. This roadmap defaults to dense embeddings (dense vectors produced by sentence-transformers / OpenAI ada-002, etc.); there are also sparse embeddings (BM25 / SPLADE, etc., based on lexical token matching) — production RAG often uses both together for hybrid search.
📍 Detail: Stage 6
Vector DB¶
The storage layer for storing + efficiently querying embeddings. The main query type = approximate nearest-neighbor (ANN) — the whole point of a vector DB is that ANN is hundreds of times faster than brute-force cosine scanning. Examples: Pinecone / Chroma / Qdrant / Weaviate / pgvector.
📍 Detail: Stage 6
Semantic Search¶
Use embeddings to compare "meaning similarity" rather than "exact string match". "How do I charge an EV" can retrieve "electric car battery tutorial". Traditional keyword search (BM25, etc.) can't do this.
Chunking¶
Splitting long documents into embedding-friendly small pieces (typically 200–1000 tokens). Chunk strategy directly affects RAG quality — too small loses context, too long blurs relevance. Common: fixed-size, by paragraph, by structure (heading-based).
Hybrid Search¶
Run semantic search and keyword search together, merge and rerank. Usually beats either alone. Production-grade RAG default.
Reranking¶
After first-pass retrieval pulls top-50, use a more expensive but more accurate model (cross-encoder) to rerank to top-5 for the LLM. Cohere Rerank, bge-reranker, etc.
Contextual Retrieval¶
Anthropic 2024 method — embed each chunk together with a summary of the document it came from, so "this chunk taken alone makes no sense" doesn't break retrieval.
📍 Detail: Stage 6
Fine-tuning¶
Re-train the model on your own data, baking knowledge or behavior into the weights (unlike RAG — RAG injects data into the context at inference time and never changes the weights). Good for making the model reliably learn a format / style / domain vocabulary; not for stuffing in "the latest facts" (that is RAG's job — facts fine-tuned in go stale and are hard to update). In most agent scenarios, try prompt + RAG first, and only consider fine-tuning if that genuinely is not enough.
📍 Detail: Stage 6
4. Multi-Agent¶
Multi-Agent¶
Multiple agents collaborating on one task. Common patterns:
- Supervisor + Worker: one agent plans/dispatches, others execute.
- Swarm: peer agents, no fixed supervisor.
- Debate: agents argue different positions, then form consensus.
📍 Detail: Stage 7
Handoff¶
One agent transfers a task to another. Adds "how to pass context" and "who handles failure" beyond a plain function call.
A2A (Agent-to-Agent) Protocol¶
Google's protocol for agent ↔ agent communication. Sibling to MCP, but for agent-to-agent rather than agent-to-tool.
5. Claude Code Ecosystem¶
MCP (Model Context Protocol)¶
Anthropic's open protocol, introduced in 2024, that lets any LLM host (Claude Code, Cursor, your own agent) connect to external tool servers through one interface. Think "USB for LLMs".
Technically it standardizes 3 primitives:
- Tools: functions an LLM can call (read DB / search web / send email…)
- Resources: data an LLM can read (file contents, API responses, DB rows…)
- Prompts: reusable prompt templates (triggered inside the host with /)
Architecture: server / client pattern — the tool server runs locally or remotely, and the LLM host connects as the client. The server exposes those primitives over one of three transports: stdio / SSE / HTTP.
📍 Detail: Stage 5.2
Skills / SKILL.md¶
Claude Code's "behavior bundles". A Skill = a folder containing SKILL.md (describing "what to do in what situations, and which tools can be called") + optional reference files / scripts.
Trigger mechanism (many people do not know this, but it matters): before Claude Code handles each message, it scans the frontmatter description field of every available skill — if it matches the current situation, the corresponding SKILL.md is auto-loaded. So the quality of the description directly determines whether the skill gets triggered. In practice, starting with "Use when ..." works best.
📍 Detail: Stage 5.3
Plugin / Marketplace¶
Package multiple Skills + slash commands + hooks + MCP configs into one shippable unit. A Marketplace is a catalog of plugins; users claude plugin install to grab community-built ones.
📍 Detail: Stage 5.4
Slash Command¶
Commands inside Claude Code starting with / (/help, /compact, /plan, etc.). Custom-definable — drop a prompt into .claude/commands/<name>.md and it becomes /name.
CLAUDE.md¶
A markdown file at project root that Claude Code reads on every launch. Project-level rules / conventions / context (language, code style, files to avoid, etc.).
Hooks¶
Scripts that run before or after specific Claude Code events. The official system supports 7 event types:
| Hook | Trigger time | Typical use |
|---|---|---|
PreToolUse |
Before a tool call | Block dangerous operations (rm -rf, destructive ops), rewrite parameters |
PostToolUse |
After a tool call | Logging, auto-format files that were just written |
UserPromptSubmit |
When the user submits a message | Add context (git status / current time) |
Notification |
When Claude Code emits a notification | Desktop toast / Slack ping |
Stop |
When the session ends | Auto-commit / cleanup |
PreCompact |
Before auto-compact | Promote important decisions into memory |
PostCompact |
After compact | Check what context got compressed |
Configuration: add a "hooks" block in .claude/settings.json and point it at your script path.
Subagent¶
A spawned agent from the main Claude Code session, with its own context window, dedicated to a specific task. E.g. "spin up a code-reviewer subagent for this diff."
How to set up: put frontmatter + system prompt + tool whitelist in .claude/agents/<name>.md. The main session invokes it with the Task tool (automatic parallel / sequential). Compared with framework-based multi-agent: a subagent does not require LangGraph / CrewAI or similar frameworks; markdown is enough, but it is tied to the Claude Code runtime. Full guide: Stage 5.5; 15 copy-paste dispatch recipes → subagent-cookbook.en.md.
6. Production / Eval / Cost¶
Eval (Evaluation Framework)¶
Run a test set against your agent and quantify accuracy / latency / cost. A production agent without eval has no tests. Common: promptfoo, LangSmith, langfuse evals.
📍 Detail: Stage 7
Observability¶
Capture every internal step (which LLM call, which tool, what result). Lets you replay when bugs hit. Common: langfuse, Helicone, weave.
📍 Detail: Stage 7
Prompt Caching¶
LLM caches the prefix of a prompt; on repeat, only the new suffix is billed at full price (Anthropic 90% off cached, OpenAI 50% off). Long-context repeated queries save a lot.
Streaming¶
LLM returns tokens as they're generated (one at a time) instead of waiting for the full response. Better UX (looks like typing); technically uses SSE or chunked transfer. Default for production interactive apps. Trade-offs: client must handle partial responses; ReAct tool-call parsing waits for stream end.
Batch API¶
Bundle a large number of LLM requests for delayed (≤24h) processing. Typically 50% off (Anthropic, OpenAI). Good for non-interactive: batch summarization, batch classification, eval suites, ETL pipelines. Don't use for interactive chat — latency unacceptable.
Token Cost / Inference Cost¶
Per LLM call: input tokens × input price + output tokens × output price. Costs of an agent's ReAct loop add up fast — a single grep over a large codebase can run 100k tokens.
Guardrails¶
Rule layer that prevents the LLM from doing bad things — block prompt injection, PII leakage, harmful output, etc. NeMo Guardrails, Guardrails AI, etc.
7. Buzzwords / Loose Terms¶
CLI Agent¶
Agents that run in a terminal (Claude Code, Codex, Aider, Gemini CLI, etc.). Versus IDE-bound (Cursor, Continue) or web-based (ChatGPT, Claude.ai).
📍 Detail: Track A A1, resources/cli-agents-guide.en.md
BYO API Key (Bring Your Own)¶
Tool that supports user-provided API keys instead of bundled subscriptions. Aider / OpenCode / goose are BYO; Claude Code / Codex default to subscriptions.
Local LLM / On-Device¶
Models running on your own hardware (Ollama, llama.cpp, MLX, LocalAI, etc.). Data stays local — privacy-friendly but capabilities lag frontier models.
📍 Detail: Stage 1
Quantization¶
Compress model weights from fp16 down to int8 / int4 to save memory and increase speed at small accuracy cost. Local LLM users see this constantly (Q4_K_M, Q8_0, etc.).
Hallucination¶
The LLM "confidently asserts something false" — invents APIs, fabricates numbers and presents them as fact. Every production agent needs defenses (RAG / structured output / eval / guardrails).
Frontier Model¶
The current top tier (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, etc.). Use frontier for hard reasoning; use cheap small models for simple classification / translation to save cost.
Context Engineering¶
The discipline of engineering what information goes into the context window on each LLM call — dynamically assembling RAG retrieval results, memory, tool definitions, and conversation history into the context the model can see. Karpathy 2025: the delicate art of putting just the right information for the next step into the window. The key question is what goes in the window, not "how many calls are involved." The next layer above prompt engineering — prompt engineering shapes strings; context engineering shapes information.
📍 Detail: Stage 2 closing / Stage 6 / Stage 7
📍 Further: Meirtz/Awesome-Context-Engineering
Harness Engineering¶
The discipline of engineering the execution and control layer around the model — everything that is not model weights and not just the prompt string itself: agent loop / tool registry / context manager / permissions / safety layer / memory layer / eval / observability / retry / circuit breaker, etc. Simon Willison 2025: coding agent = LLM + harness. Addy Osmani: harness = all the code that is not the model itself. OpenAI also used the term "Harness Engineering" in February 2026. Claude Code, Cursor, OpenCode, etc. are harnesses. A framework wraps an LLM into an agent; a harness wraps an agent into a product that can actually go live.
Contrast: - Framework (Stage 4) defines the API: what the interface you call looks like - Harness (this term) defines the runtime: how it runs, how it recovers, how it is observed
📍 Discipline-level concept (8 core components / prompt→context→harness three-layer engineering split / framework vs harness): Stage 7 Harness Engineering
📍 Reference implementation case study (reading Claude Code source): Stage 5 5.6
📍 Further: anthropics/claude-agent-sdk-python, ai-boost/awesome-harness-engineering, ZhangHanDong/harness-engineering-from-cc-to-ai-coding
8. Agent Interfaces¶
Computer Use (screen-level agent)¶
An agent operates real desktop apps via screenshot → vision → coordinates → simulated mouse/keyboard — no API needed, the agent uses the screen like a human. Representative: Anthropic Claude Computer Use (Opus 4.7 / Sonnet 4.6), OpenAI Codex desktop, Google Gemini in Chrome. Anthropic public beta opened Oct 2024; OSWorld benchmark reached 76.26% (superhuman) by May 2026.
📍 Full coverage + 4-vendor comparison: Stage 8 Computer Use
Browser Use (web-level agent)¶
An agent operates web pages, primarily via DOM-aware navigation (direct CSS selector queries) with vision fallback. Closed-source: Atlas / Comet / Dia / Gemini in Chrome. OSS leader: browser-use (★ 86k+).
📍 Full coverage + 5-vendor comparison + OSS frameworks: Stage 8 Browser Use
Sandbox (code execution isolation)¶
Runs agent-written code in an isolated environment instead of the host — avoids rm -rf /, internet data exfiltration, credential theft. Representatives: E2B (Firecracker microVM), Daytona (container), Modal (GPU sandbox), Vercel, Cloudflare. OpenAI Agents SDK natively supports these as of April 2026.
📍 Full 9-row terminology glossary + 7-vendor comparison: Stage 8 Code Sandbox
microVM (micro Virtual Machine)¶
A slimmed-down VM with minimal footprint, < 100ms startup, yet still has an independent kernel — sits between Docker containers (fast + weak isolation) and full VMs (slow + strong isolation). Most agent sandboxes choose microVM. Implementation example: Firecracker (AWS, used by E2B).
📍 Full comparison: Stage 8 terminology glossary
Firecracker¶
AWS's open-source microVM, written in Rust, the underlying technology of AWS Lambda and E2B sandbox isolation. Provides strong isolation + fast startup.
gVisor¶
Google's "user-space kernel" — intercepts syscalls and emulates them itself, no hypervisor required. Sits between containers and VMs.
Term not here?¶
- Read the actual stage content: Stage 5.2 MCP / 5.3 Skills / 5.4 Plugins
- Required reading lists in Stage 1 / Stage 6 / Stage 7 / Stage 8
- Missing? Open an issue or PR a new entry.