Skip to content

Glossary

繁體中文 | 简体中文 | English

The roadmap leans heavily on terms like LLM, RAG, MCP, agent. Look up unfamiliar ones here in 30 seconds, then go back to reading the stage.

Each entry gives the smallest usable definition (30-80 words + which stage covers it in depth) — not Wikipedia.

🌐 Unified terminology table (English ↔ Chinese, consistent across stages)

This table is the project's enforced naming convention — every stage uses the same Chinese readability label for each English term. If you see drift inside a stage, please file an issue.

English term Chinese readability label Primary stage
Prompt Engineering Prompt 設計 / Prompt 设计 Stage 2
Context Engineering 上下文管理 Stage 6
Harness Engineering Agent 執行系統設計 / Agent 执行系统设计 Stage 7
Tool Use 工具使用 Stage 3
Function Calling 函式 / 函数 / 工具呼叫 Stage 3
Structured Output 結構化輸出 / 结构化输出 Stage 3
Agent Loop Agent 執行迴圈 / Agent 执行循环 Stage 3
Framework 框架 Stage 4
Orchestration 協調與編排 / 协调与编排 Stage 4 / 7
Handoff 任務交接 / 任务交接 Stage 7
Supervisor / Worker 協調者 / 執行者 (协调者 / 执行者) Stage 7
Runtime 執行層 / 执行层 Stage 7
Scaffolding 支撐架構 / 支撑架构 Stage 7
Observability 觀測與紀錄 / 观测与记录 Stage 7
Telemetry 運行紀錄 / 运行记录 Stage 7
Eval 效果評估 / 效果评估 Stage 7
Evaluation Harness 評估框架 / 评估框架 Stage 7
Production 可穩定使用 / 上線化 (可稳定使用 / 上线化) Stage 7
Production-grade 可長期穩定使用的 / 可长期稳定使用的 Stage 7
Deployment 部署 Stage 7
Cost Tracking 成本追蹤 / 成本追踪 Stage 7
Latency 延遲 / 等待時間 (延迟 / 等待时间) Stage 7
Vector DB 向量資料庫 / 向量数据库 Stage 6
Retrieval 檢索 / 检索 Stage 6
Reranking 重排序 Stage 6
Long Context 長上下文 / 长上下文 Stage 6
Fine-tuning 模型微調 / 模型微调 Stage 6
Agent Interfaces Agent 操作介面 / Agent 操作界面 Stage 8
Code Sandbox 隔離程式執行環境 / 隔离程序执行环境 Stage 8
Cold Start 啟動延遲 / 启动延迟 Stage 8
Reward Hacking 鑽評分漏洞 / 钻评分漏洞 Stage 7 / 8

→ For full definitions, see the sections below.


1. Basic concepts

LLM (Large Language Model)

GPT, Claude, Gemini — models that take text in and produce text out. Fundamentally a pure function: input prompt → output text. They don't browse the web, they don't remember past conversations — those need to be wired up externally.

📍 Detail: Stage 1

Token

LLMs see tokens (sub-word units), not characters. Roughly 1 English word ≈ 1.3 tokens, 1 Chinese character ≈ 1.5–2 tokens. LLM pricing and context windows are measured in tokens. "1M-token context" ≈ 750k English words.

📍 Detail: Stage 1

Context Window

The maximum tokens an LLM can "see" in one call. 2026 frontier: Claude Sonnet 4.6 / Opus 4.7 1M, GPT-5.5 ~400k, Gemini 3.1 Pro 2M. Bigger isn't always better — beyond a length the LLM gets "Lost in the Middle".

Prompt

The text you feed an LLM. Prompt engineering = designing that text to get good answers. Basic structure: system prompt (role/rules) + user prompt (the actual ask).

📍 Detail: Stage 2

Few-shot / Zero-shot

  • Zero-shot: ask directly without examples.
  • Few-shot: give 2–5 input → output examples first. Few-shot usually improves accuracy a lot, especially for strict formatting.

Chain-of-Thought (CoT)

Make the LLM "think before answering" — have it output the reasoning process before the conclusion. Two common forms:

  • Few-shot CoT (Wei et al. 2022): put a few examples with reasoning steps into the prompt, and the LLM imitates that style of thinking
  • Zero-shot CoT (Kojima et al. 2022): add "Let's think step by step" at the end of the prompt to trigger a reasoning trace

Accuracy usually improves, at the cost of more tokens. Few-shot usually beats zero-shot.


2. Agents / Tool Use

Agent

A system centered on an LLM that can perceive state → make decisions → take actions → observe results in a loop, repeating until the goal is completed. Three core elements:

  • LLM (reasoning / planning / deciding)
  • Actions (ways to do things — not limited to function calls. This can include writing and running code (CodeAct), operating a browser (computer use), retrieving from a KB (RAG retrieval), calling an MCP server, or pure planning / task decomposition)
  • Loop (the heartbeat — the fundamental difference between an agent and plain LLM Q&A)

The difference is this: plain LLM = Q&A; agent = the three elements + a continuing loop until the goal is reached or the budget runs out. ReAct is one agent pattern, not the definition of an agent — CodeAct, computer-use, and planning agents are all agents.

📍 Detail: Stage 3

Tool Use / Function Calling

Lets the LLM call functions you defined (DB lookup, math, browser, …). Instead of plain text, the LLM returns {"function": "search", "args": {…}}. Your code executes it and feeds the result back to the LLM.

The concept is the same; the API schema differs: - Anthropic "Tool Use": uses input_schema (JSON Schema directly) - OpenAI / Ollama "Function Calling": wraps it in an outer {"type": "function", "function": {...}} - The token representation the LLM sees differs internally, so when writing a cross-vendor SDK you need to map them correctly

📍 Detail: Stage 3 📍 How to write good schemas: Function Schema Design cheatsheet

ReAct (Reasoning + Acting)

The classic agent pattern: Thought → Action (call tool) → Observation (see result) → Thought ... loop until done. Most agent frameworks implement this internally.

📍 Detail: Stage 3

Structured Output

Make the LLM output JSON or another fixed schema instead of free text. All major LLM APIs have response_format or similar. Agent frameworks rely on it for LLM ↔ code communication.

Agent Loop

The "LLM → tool → result → LLM" repeated cycle. Termination: LLM says "done" / step budget exhausted / cost cap hit.

Self-Refine (Basic reflection / no memory)

The agent evaluates the previous round's output and changes the next round's behavior — an "Actor answers → Critic finds issues → Actor reads feedback and answers again" single-session loop. It does not need a persistent memory layer; it is purely a reasoning-loop mechanism, a sibling pattern to ReAct. Production agents (Cursor / Cline / Claude Code) run variants of this every day.

Representative paper: Self-Refine (Madaan 2023). For the full Reflexion version (with episodic memory), see 3 Memory / Retrieval / RAG.

📍 Detail + routing: Stage 3 Reflection


3. Memory / Retrieval / RAG

Memory — Two Orthogonal Axes

"Memory" often gets lumped together, but there are actually 2 orthogonal classification axes:

  • Time axis: short-term (current conversation) vs long-term (persistent across sessions)
  • Content axis (CoALA framework): Working (scratch space) / Episodic (past experiences) / Semantic (factual knowledge) / Procedural (how to do things)

→ The two axes do not conflict: long-term memory can contain at the same time episodic memory (what the user said last time), semantic memory (facts from the company's knowledge base), and procedural memory (tool sequences that worked before).

📍 Detail: Stage 6 What is Memory + How to Design It + Stage 6 CoALA Framework

RAG (Retrieval-Augmented Generation)

Two-stage architectural pattern:

  1. Ingest (one-time / periodic): document → chunk → embed → store in a vector store (build a retrievable KB)
  2. Query (every user question): embed the question → semantic search (or hybrid + BM25) → top-K chunks → put them into the prompt → LLM answers

Solves the problem that the LLM does not know your private / changing / stale data. Retrieval is not limited to dense embeddings — the production default is hybrid (dense + BM25) + reranker.

📍 Detail: Stage 6 📍 Paper: Lewis et al. 2020

Reflexion (Full reflection / with episodic memory)

Unlike Self-Refine (2 Agents), Reflexion requires a persistent episodic memory store — after each trial, the agent writes a reflection summary into memory, then retrieves it into the prompt at the start of the next trial. Accumulating lessons across trials is the essence of Reflexion (not a single-session loop).

It is placed in 3 instead of 2 because it is fundamentally a memory pattern — the episodic memory store is core, not optional.

Representative paper: Reflexion (Shinn 2023).

📍 Detail: Stage 6 Advanced: Full Reflexion with Persistent Memory

Embedding

Turn text / images into N-dimensional vectors so that things with similar meanings are close together. This roadmap defaults to dense embeddings (dense vectors produced by sentence-transformers / OpenAI ada-002, etc.); there are also sparse embeddings (BM25 / SPLADE, etc., based on lexical token matching) — production RAG often uses both together for hybrid search.

📍 Detail: Stage 6

Vector DB

The storage layer for storing + efficiently querying embeddings. The main query type = approximate nearest-neighbor (ANN) — the whole point of a vector DB is that ANN is hundreds of times faster than brute-force cosine scanning. Examples: Pinecone / Chroma / Qdrant / Weaviate / pgvector.

📍 Detail: Stage 6

Use embeddings to compare "meaning similarity" rather than "exact string match". "How do I charge an EV" can retrieve "electric car battery tutorial". Traditional keyword search (BM25, etc.) can't do this.

Chunking

Splitting long documents into embedding-friendly small pieces (typically 200–1000 tokens). Chunk strategy directly affects RAG quality — too small loses context, too long blurs relevance. Common: fixed-size, by paragraph, by structure (heading-based).

Run semantic search and keyword search together, merge and rerank. Usually beats either alone. Production-grade RAG default.

Reranking

After first-pass retrieval pulls top-50, use a more expensive but more accurate model (cross-encoder) to rerank to top-5 for the LLM. Cohere Rerank, bge-reranker, etc.

Contextual Retrieval

Anthropic 2024 method — embed each chunk together with a summary of the document it came from, so "this chunk taken alone makes no sense" doesn't break retrieval.

📍 Detail: Stage 6

Fine-tuning

Re-train the model on your own data, baking knowledge or behavior into the weights (unlike RAG — RAG injects data into the context at inference time and never changes the weights). Good for making the model reliably learn a format / style / domain vocabulary; not for stuffing in "the latest facts" (that is RAG's job — facts fine-tuned in go stale and are hard to update). In most agent scenarios, try prompt + RAG first, and only consider fine-tuning if that genuinely is not enough.

📍 Detail: Stage 6


4. Multi-Agent

Multi-Agent

Multiple agents collaborating on one task. Common patterns:

  • Supervisor + Worker: one agent plans/dispatches, others execute.
  • Swarm: peer agents, no fixed supervisor.
  • Debate: agents argue different positions, then form consensus.

📍 Detail: Stage 7

Handoff

One agent transfers a task to another. Adds "how to pass context" and "who handles failure" beyond a plain function call.

A2A (Agent-to-Agent) Protocol

Google's protocol for agent ↔ agent communication. Sibling to MCP, but for agent-to-agent rather than agent-to-tool.


5. Claude Code Ecosystem

MCP (Model Context Protocol)

Anthropic's open protocol, introduced in 2024, that lets any LLM host (Claude Code, Cursor, your own agent) connect to external tool servers through one interface. Think "USB for LLMs".

Technically it standardizes 3 primitives: - Tools: functions an LLM can call (read DB / search web / send email…) - Resources: data an LLM can read (file contents, API responses, DB rows…) - Prompts: reusable prompt templates (triggered inside the host with /)

Architecture: server / client pattern — the tool server runs locally or remotely, and the LLM host connects as the client. The server exposes those primitives over one of three transports: stdio / SSE / HTTP.

📍 Detail: Stage 5.2

Skills / SKILL.md

Claude Code's "behavior bundles". A Skill = a folder containing SKILL.md (describing "what to do in what situations, and which tools can be called") + optional reference files / scripts.

Trigger mechanism (many people do not know this, but it matters): before Claude Code handles each message, it scans the frontmatter description field of every available skill — if it matches the current situation, the corresponding SKILL.md is auto-loaded. So the quality of the description directly determines whether the skill gets triggered. In practice, starting with "Use when ..." works best.

📍 Detail: Stage 5.3

Plugin / Marketplace

Package multiple Skills + slash commands + hooks + MCP configs into one shippable unit. A Marketplace is a catalog of plugins; users claude plugin install to grab community-built ones.

📍 Detail: Stage 5.4

Slash Command

Commands inside Claude Code starting with / (/help, /compact, /plan, etc.). Custom-definable — drop a prompt into .claude/commands/<name>.md and it becomes /name.

CLAUDE.md

A markdown file at project root that Claude Code reads on every launch. Project-level rules / conventions / context (language, code style, files to avoid, etc.).

Hooks

Scripts that run before or after specific Claude Code events. The official system supports 7 event types:

Hook Trigger time Typical use
PreToolUse Before a tool call Block dangerous operations (rm -rf, destructive ops), rewrite parameters
PostToolUse After a tool call Logging, auto-format files that were just written
UserPromptSubmit When the user submits a message Add context (git status / current time)
Notification When Claude Code emits a notification Desktop toast / Slack ping
Stop When the session ends Auto-commit / cleanup
PreCompact Before auto-compact Promote important decisions into memory
PostCompact After compact Check what context got compressed

Configuration: add a "hooks" block in .claude/settings.json and point it at your script path.

Subagent

A spawned agent from the main Claude Code session, with its own context window, dedicated to a specific task. E.g. "spin up a code-reviewer subagent for this diff."

How to set up: put frontmatter + system prompt + tool whitelist in .claude/agents/<name>.md. The main session invokes it with the Task tool (automatic parallel / sequential). Compared with framework-based multi-agent: a subagent does not require LangGraph / CrewAI or similar frameworks; markdown is enough, but it is tied to the Claude Code runtime. Full guide: Stage 5.5; 15 copy-paste dispatch recipessubagent-cookbook.en.md.


6. Production / Eval / Cost

Eval (Evaluation Framework)

Run a test set against your agent and quantify accuracy / latency / cost. A production agent without eval has no tests. Common: promptfoo, LangSmith, langfuse evals.

📍 Detail: Stage 7

Observability

Capture every internal step (which LLM call, which tool, what result). Lets you replay when bugs hit. Common: langfuse, Helicone, weave.

📍 Detail: Stage 7

Prompt Caching

LLM caches the prefix of a prompt; on repeat, only the new suffix is billed at full price (Anthropic 90% off cached, OpenAI 50% off). Long-context repeated queries save a lot.

Streaming

LLM returns tokens as they're generated (one at a time) instead of waiting for the full response. Better UX (looks like typing); technically uses SSE or chunked transfer. Default for production interactive apps. Trade-offs: client must handle partial responses; ReAct tool-call parsing waits for stream end.

Batch API

Bundle a large number of LLM requests for delayed (≤24h) processing. Typically 50% off (Anthropic, OpenAI). Good for non-interactive: batch summarization, batch classification, eval suites, ETL pipelines. Don't use for interactive chat — latency unacceptable.

Token Cost / Inference Cost

Per LLM call: input tokens × input price + output tokens × output price. Costs of an agent's ReAct loop add up fast — a single grep over a large codebase can run 100k tokens.

Guardrails

Rule layer that prevents the LLM from doing bad things — block prompt injection, PII leakage, harmful output, etc. NeMo Guardrails, Guardrails AI, etc.


7. Buzzwords / Loose Terms

CLI Agent

Agents that run in a terminal (Claude Code, Codex, Aider, Gemini CLI, etc.). Versus IDE-bound (Cursor, Continue) or web-based (ChatGPT, Claude.ai).

📍 Detail: Track A A1, resources/cli-agents-guide.en.md

BYO API Key (Bring Your Own)

Tool that supports user-provided API keys instead of bundled subscriptions. Aider / OpenCode / goose are BYO; Claude Code / Codex default to subscriptions.

Local LLM / On-Device

Models running on your own hardware (Ollama, llama.cpp, MLX, LocalAI, etc.). Data stays local — privacy-friendly but capabilities lag frontier models.

📍 Detail: Stage 1

Quantization

Compress model weights from fp16 down to int8 / int4 to save memory and increase speed at small accuracy cost. Local LLM users see this constantly (Q4_K_M, Q8_0, etc.).

Hallucination

The LLM "confidently asserts something false" — invents APIs, fabricates numbers and presents them as fact. Every production agent needs defenses (RAG / structured output / eval / guardrails).

Frontier Model

The current top tier (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, etc.). Use frontier for hard reasoning; use cheap small models for simple classification / translation to save cost.

Context Engineering

The discipline of engineering what information goes into the context window on each LLM call — dynamically assembling RAG retrieval results, memory, tool definitions, and conversation history into the context the model can see. Karpathy 2025: the delicate art of putting just the right information for the next step into the window. The key question is what goes in the window, not "how many calls are involved." The next layer above prompt engineering — prompt engineering shapes strings; context engineering shapes information.

📍 Detail: Stage 2 closing / Stage 6 / Stage 7 📍 Further: Meirtz/Awesome-Context-Engineering

Harness Engineering

The discipline of engineering the execution and control layer around the model — everything that is not model weights and not just the prompt string itself: agent loop / tool registry / context manager / permissions / safety layer / memory layer / eval / observability / retry / circuit breaker, etc. Simon Willison 2025: coding agent = LLM + harness. Addy Osmani: harness = all the code that is not the model itself. OpenAI also used the term "Harness Engineering" in February 2026. Claude Code, Cursor, OpenCode, etc. are harnesses. A framework wraps an LLM into an agent; a harness wraps an agent into a product that can actually go live.

Contrast: - Framework (Stage 4) defines the API: what the interface you call looks like - Harness (this term) defines the runtime: how it runs, how it recovers, how it is observed

📍 Discipline-level concept (8 core components / prompt→context→harness three-layer engineering split / framework vs harness): Stage 7 Harness Engineering 📍 Reference implementation case study (reading Claude Code source): Stage 5 5.6 📍 Further: anthropics/claude-agent-sdk-python, ai-boost/awesome-harness-engineering, ZhangHanDong/harness-engineering-from-cc-to-ai-coding


8. Agent Interfaces

Computer Use (screen-level agent)

An agent operates real desktop apps via screenshot → vision → coordinates → simulated mouse/keyboard — no API needed, the agent uses the screen like a human. Representative: Anthropic Claude Computer Use (Opus 4.7 / Sonnet 4.6), OpenAI Codex desktop, Google Gemini in Chrome. Anthropic public beta opened Oct 2024; OSWorld benchmark reached 76.26% (superhuman) by May 2026.

📍 Full coverage + 4-vendor comparison: Stage 8 Computer Use

Browser Use (web-level agent)

An agent operates web pages, primarily via DOM-aware navigation (direct CSS selector queries) with vision fallback. Closed-source: Atlas / Comet / Dia / Gemini in Chrome. OSS leader: browser-use (★ 86k+).

📍 Full coverage + 5-vendor comparison + OSS frameworks: Stage 8 Browser Use

Sandbox (code execution isolation)

Runs agent-written code in an isolated environment instead of the host — avoids rm -rf /, internet data exfiltration, credential theft. Representatives: E2B (Firecracker microVM), Daytona (container), Modal (GPU sandbox), Vercel, Cloudflare. OpenAI Agents SDK natively supports these as of April 2026.

📍 Full 9-row terminology glossary + 7-vendor comparison: Stage 8 Code Sandbox

microVM (micro Virtual Machine)

A slimmed-down VM with minimal footprint, < 100ms startup, yet still has an independent kernel — sits between Docker containers (fast + weak isolation) and full VMs (slow + strong isolation). Most agent sandboxes choose microVM. Implementation example: Firecracker (AWS, used by E2B).

📍 Full comparison: Stage 8 terminology glossary

Firecracker

AWS's open-source microVM, written in Rust, the underlying technology of AWS Lambda and E2B sandbox isolation. Provides strong isolation + fast startup.

gVisor

Google's "user-space kernel" — intercepts syscalls and emulates them itself, no hypervisor required. Sits between containers and VMs.


Term not here?