Build Your First AI Agent in 7 Steps¶
📌 This is for Track B (Agent Builder) — teaches you to write an agent from scratch. Track A (CLI Power User) learners do not need to run this; but reading it gives deeper understanding of "how an agent gets composed step-by-step from LLM API to production" — optional advanced supplement.
This is a concrete cross-stage walkthrough — the same agent, traced from Stage 1 through Stage 7, with executable code skeletons at each stage.
How to read this: each section extends the previous one. Later snippets assume earlier stage files are in the same directory. To run: 1. Set up the environment in Stage 0 2. Save each stage to a new file (
step1_*.py,step2_*.py, …) 3. Later stages import from earlier ones viafrom step1_xxx import ...Install all deps at once:
pip install anthropic openai requests beautifulsoup4 langgraph langchain-anthropic langchain-core chromadb langfuse fastapi uvicorn pydantic
The agent to build: Paper Summary Bot — given an arXiv paper URL, output a 3-paragraph summary + 5 keywords + comparison with related work.
Each stage adds one capability to the same agent. By the end it's a multi-LLM, memory-equipped, deployable agent.
📋 Overview¶
| Stage | Capability you add | Code complexity |
|---|---|---|
| 0 | Environment (Python, API key, git) | — |
| 1 | First LLM API call | ~10 lines |
| 2 | Write a professional prompt | ~20 lines |
| 3 | Tool use: auto-fetch arXiv | ~80 lines |
| 4 | Rewrite with framework + reflection | ~40 lines (framework abstracts the loop) |
| 5 | Package as Claude Code Skill | SKILL.md + 30 lines |
| 6 | Add RAG memory: compare with past papers | ~60 lines |
| 7 | Add eval, observability, deploy | ~100 lines |
Total: ~350 lines of Python + structured config = a concrete example you watch grow from zero to production.
Stage 0 — Environment¶
# Install Python 3.11+
python --version
# Virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install all packages used across stages (one-time; later stages won't pip install again)
pip install anthropic openai requests beautifulsoup4 \
langgraph langchain-anthropic langchain-core \
chromadb langfuse fastapi uvicorn pydantic
# Claude API key (apply at console.anthropic.com)
export ANTHROPIC_API_KEY="sk-ant-..."
# Init repo
mkdir paper-summary-bot && cd paper-summary-bot
git init
echo ".env\n.venv/\n__pycache__/" > .gitignore
Checkpoint: python -c "from anthropic import Anthropic; print('OK')" should work without error.
Stage 1 — First LLM Call¶
# step1_hello_llm.py
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=500,
messages=[{
"role": "user",
"content": "Explain ReAct agents in 3 sentences."
}]
)
print(response.content[0].text)
print(f"\n--- Tokens: input={response.usage.input_tokens}, "
f"output={response.usage.output_tokens} ---")
Run: python step1_hello_llm.py
What you learn: API call shape, messages structure, how usage counts tokens.
Stage 2 — Professional Prompt¶
# step2_paper_summary.py
from anthropic import Anthropic
client = Anthropic()
SYSTEM_PROMPT = """You are an academic paper summarization assistant. Your task:
1. Write a 3-paragraph summary describing: (a) motivation, (b) method, (c) results.
2. List 5 keywords.
3. Bullet 2-3 differences from mainstream approaches.
Format requirements:
- Each summary paragraph ≤ 60 words
- Keywords in English (technical terms)
- Total ≤ 300 words
- Don't fabricate; if not stated, say "not stated in the paper"."""
PAPER_TEXT = """[Paste paper abstract here]"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=800,
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": PAPER_TEXT}]
)
print(response.content[0].text)
What you learn: system prompt vs user message split, explicit format constraints, anti-hallucination via "say not stated."
Stage 3 — Tool Use: Auto-Fetch Papers¶
# step3_tool_use.py
import requests
from anthropic import Anthropic
from step2_paper_summary import SYSTEM_PROMPT # written in the previous stage
client = Anthropic()
# Define tool
TOOLS = [{
"name": "fetch_arxiv",
"description": "Fetch arXiv paper abstract by URL",
"input_schema": {
"type": "object",
"properties": {
"arxiv_url": {"type": "string"}
},
"required": ["arxiv_url"]
}
}]
def fetch_arxiv(arxiv_url: str) -> str:
"""Tool implementation."""
arxiv_id = arxiv_url.split("/")[-1].replace(".pdf", "")
api_url = f"http://export.arxiv.org/api/query?id_list={arxiv_id}"
r = requests.get(api_url)
# Simplified: real version should parse XML
return r.text[:5000]
# ReAct loop
def run_agent(user_query: str):
messages = [{"role": "user", "content": user_query}]
while True:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2000,
tools=TOOLS,
messages=messages,
system=SYSTEM_PROMPT, # from Stage 2
)
# No more tool calls → done
if response.stop_reason == "end_turn":
return response.content[-1].text
# Handle tool call
tool_use = next(b for b in response.content if b.type == "tool_use")
if tool_use.name == "fetch_arxiv":
result = fetch_arxiv(**tool_use.input)
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_use.id,
"content": result,
}]
})
# Run
print(run_agent("Summarize this paper: https://arxiv.org/abs/2210.03629"))
What you learn: tool schema syntax, ReAct loop mechanics, stop_reason for termination, tool_result round-trip.
This is the biggest Stage 3 leap — your code goes from "calling LLM" to "LLM calling your code."
Stage 4 — Framework + Reflection¶
Install:
pip install langgraph langchain-anthropic langchain-core
Rewrite with LangGraph and add a self-review node:
# step4_langgraph.py
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import create_react_agent
from langgraph.graph.message import add_messages
from langchain_anthropic import ChatAnthropic
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage
@tool
def fetch_arxiv(arxiv_url: str) -> str:
"""Fetch arXiv paper abstract."""
import requests
arxiv_id = arxiv_url.split("/")[-1].replace(".pdf", "")
r = requests.get(f"http://export.arxiv.org/api/query?id_list={arxiv_id}")
return r.text[:5000]
class State(TypedDict):
messages: Annotated[list, add_messages]
revisions: int # bound the loop
llm = ChatAnthropic(model="claude-sonnet-4-6")
react_agent = create_react_agent(llm, tools=[fetch_arxiv])
MAX_REVISIONS = 2
def reflect(state: State) -> State:
"""Have the LLM review the previous summary and decide whether to redo."""
last_summary = state["messages"][-1].content
# Use an explicit yes/no verdict instead of keyword-matching prose
review_prompt = (
f"Does the following summary satisfy: 3 paragraphs, each ≤60 words, "
f"5 English keywords, no fabrication?\n\n"
f"{last_summary}\n\n"
"Reply with PASS or NEEDS_REVISION only — no explanation."
)
verdict = llm.invoke(review_prompt).content.strip().upper()
return {
"messages": [HumanMessage(content=f"[Reviewer verdict: {verdict}]")],
"revisions": state.get("revisions", 0) + 1,
}
def should_continue(state: State) -> str:
"""Decide whether to loop back to agent or terminate."""
last_msg = state["messages"][-1].content
if state["revisions"] >= MAX_REVISIONS:
return END # bound reached, exit unconditionally
if "NEEDS_REVISION" in last_msg:
return "agent" # redo
return END # PASS → exit
# Build graph
graph = StateGraph(State)
graph.add_node("agent", react_agent)
graph.add_node("reflect", reflect)
graph.add_edge("agent", "reflect")
graph.add_conditional_edges("reflect", should_continue, {"agent": "agent", END: END})
graph.set_entry_point("agent")
app = graph.compile()
# Run
result = app.invoke({
"messages": [HumanMessage(content="Summarize https://arxiv.org/abs/2210.03629")],
"revisions": 0,
})
print(result["messages"][-1].content)
What you learn: what the framework abstracts (while loop, message structure, tool registration), how to define conditional branches with proper termination, how the reflection pattern lets an agent self-correct within a bounded number of rounds (no infinite loop).
Note: After Stage 4 we don't show LangGraph state internals again — later stages treat the LangGraph agent as a black box.
Stage 5 — Claude Code Project Skill¶
This step is not Python — it's repackaging the logic from Stages 1-4 as a Claude Code project skill that Claude loads natively. With a clear
description, Claude will auto-trigger it when the user mentions a relevant request.
In your repo, create:
your-repo/
└── .claude/
└── skills/
└── paper-summary/
└── SKILL.md
SKILL.md content:
---
name: paper-summary
description: Summarize arXiv papers. Trigger when the user pastes an arXiv URL, mentions a paper ID (e.g. 2210.03629), or asks "summarize this paper / 摘要論文". Output: 3-paragraph summary + 5 keywords + differences from mainstream.
---
# Paper Summary Skill
## What this does
Summarize an arXiv paper into 3 structured paragraphs + keywords + difference points.
## When Claude should use this
The user:
- Pastes an arXiv URL (`https://arxiv.org/abs/...` or `arxiv.org/pdf/...`)
- Mentions a specific paper (title or ID) and asks for a summary
- Asks "how does this paper differ from other approaches"
## How to do it
1. Fetch paper content from the URL (use Claude Code's built-in WebFetch tool; or Read tool if a PDF is attached)
2. Apply this prompt structure:
- Motivation (≤60 words)
- Method (≤60 words)
- Results (≤60 words)
- 5 English keywords
- 2-3 differences from mainstream
3. If something isn't stated, say "not stated in the paper" — never fabricate
## References
- `references/example-summaries.md` — 3 example outputs in the target style
Once placed, open Claude Code in this repo — project-level skills auto-load (no install command needed). Claude triggers the skill when the user's input matches the description.
To verify it works: paste https://arxiv.org/abs/2210.03629 in a Claude Code session, see whether Claude responds in your defined format.
What you learn: the difference between project skills and plugin marketplace skills (this one is project-level, active as soon as you're in the repo; plugins are a separate distribution layer); description is the discovery mechanism (not a magic trigger_phrases field); how references/ extends a skill with longer examples.
Going further: if you want to package this skill as a shareable plugin (so others can install it in their own Claude Code), see Stage 5.4 Plugins & Marketplaces. This walkthrough doesn't cover plugin packaging.
Stage 6 — RAG Memory¶
Make the agent remember papers it has seen, comparing new ones against the past.
# step6_memory.py
import chromadb
from chromadb.utils import embedding_functions
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-sonnet-4-6")
# Local vector DB
chroma = chromadb.PersistentClient(path="./paper_memory")
embed_fn = embedding_functions.DefaultEmbeddingFunction()
collection = chroma.get_or_create_collection(
name="papers",
embedding_function=embed_fn,
)
def store_paper(arxiv_id: str, summary: str):
"""Store summary in vector DB."""
collection.add(
documents=[summary],
ids=[arxiv_id],
metadatas=[{"arxiv_id": arxiv_id}],
)
def find_similar(query_summary: str, top_k: int = 3) -> list[dict]:
"""Find top 3 most similar past papers."""
results = collection.query(query_texts=[query_summary], n_results=top_k)
return [
{"id": id_, "summary": doc}
for id_, doc in zip(results["ids"][0], results["documents"][0])
]
# Modify Stage 4's agent — add a compare_with_memory step:
def compare_with_memory(state):
new_summary = state["messages"][-1].content
similar = find_similar(new_summary, top_k=3)
if not similar:
return {"comparison": "(no related papers in DB)"}
compare_prompt = f"""New paper summary: {new_summary}
Top 3 similar papers in DB:
{chr(10).join(f"- {p['id']}: {p['summary'][:200]}" for p in similar)}
List 2-3 unique contributions of the new paper not covered above."""
response = llm.invoke(compare_prompt)
# Store new paper in memory
store_paper(arxiv_id="...", summary=new_summary)
return {"comparison": response.content}
Wire compare_with_memory into the Stage 4 graph:
# step6_memory.py (continued)
from step4_langgraph import State, react_agent, reflect, should_continue, MAX_REVISIONS
from langgraph.graph import StateGraph, END
graph = StateGraph(State)
graph.add_node("agent", react_agent)
graph.add_node("reflect", reflect)
graph.add_node("compare", compare_with_memory) # the new node
graph.add_edge("agent", "reflect")
graph.add_conditional_edges("reflect", should_continue, {"agent": "agent", END: "compare"})
graph.add_edge("compare", END)
graph.set_entry_point("agent")
app_with_memory = graph.compile()
What you learn: how to use a vector DB, embeddings + similarity queries, taking an agent from "stateless" to "stateful," persistent storage design, and how to extend a graph with a new node without rewriting earlier logic.
Stage 7 — Eval + Observability + Deploy¶
7.1 Eval (promptfoo)¶
Install:
npm install -g promptfoo
Promptfoo's Python provider expects a callable function, not a module variable. So wrap a thin provider:
# eval_provider.py
"""Promptfoo Python provider — function called by promptfoo."""
from step2_paper_summary import SYSTEM_PROMPT
from step3_tool_use import run_agent # ReAct loop from Stage 3
def call_api(prompt: str, options: dict, context: dict) -> dict:
"""Promptfoo passes vars (context['vars']) + prompt."""
paper_url = context["vars"]["paper_url"]
output = run_agent(f"Summarize this paper: {paper_url}")
return {"output": output}
# promptfooconfig.yaml
prompts:
- "Summarize: {{paper_url}}"
providers:
- id: file://eval_provider.py
label: paper-summary-agent
tests:
- description: "ReAct paper"
vars:
paper_url: "https://arxiv.org/abs/2210.03629"
assert:
- type: contains
value: "Reasoning"
- type: llm-rubric
value: "Output contains 5 English keywords, each paragraph ≤ 60 words"
- description: "RAG paper"
vars:
paper_url: "https://arxiv.org/abs/2104.08663"
assert:
- type: contains
value: "retrieval"
Run: promptfoo eval && promptfoo view
7.2 Observability (langfuse)¶
Install:
pip install langfuseEnv vars (apply at cloud.langfuse.com):export LANGFUSE_PUBLIC_KEY="pk-lf-..." export LANGFUSE_SECRET_KEY="sk-lf-..." export LANGFUSE_HOST="https://cloud.langfuse.com" # or your self-hosted URL
# step7_observability.py
from langfuse.decorators import observe
from step3_tool_use import run_agent # agent from earlier stages
@observe(name="paper-summary-agent")
def run_paper_agent(arxiv_url: str) -> str:
return run_agent(f"Summarize {arxiv_url}")
if __name__ == "__main__":
out = run_paper_agent("https://arxiv.org/abs/2210.03629")
print(out)
After running, view per-call trace, cost, latency, and tool use in the Langfuse dashboard.
7.3 Deploy (Docker + FastAPI)¶
Install:
pip install fastapi uvicorn pydantic
# main.py
from fastapi import FastAPI
from pydantic import BaseModel
from step7_observability import run_paper_agent # the Langfuse-wrapped version
app = FastAPI()
class PaperRequest(BaseModel):
arxiv_url: str
@app.post("/summarize")
def summarize(req: PaperRequest):
return {"summary": run_paper_agent(req.arxiv_url)}
# requirements.txt
anthropic
requests
langgraph
langchain-anthropic
langchain-core
chromadb
langfuse
fastapi
uvicorn
pydantic
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
docker build -t paper-summary-bot .
docker run -p 8000:8000 \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
-e LANGFUSE_PUBLIC_KEY=$LANGFUSE_PUBLIC_KEY \
-e LANGFUSE_SECRET_KEY=$LANGFUSE_SECRET_KEY \
paper-summary-bot
# Or deploy to Cloud Run / Fly.io / Railway / your own K8s
What you learn: eval as regression test, observability for debugging deployed agents, taking an agent from script to service.
✅ After the full walkthrough you should be able to:¶
- [ ] Build a ReAct agent from scratch (Stage 3)
- [ ] Rewrite with a framework and add advanced patterns (Stage 4)
- [ ] Package an agent as a Claude Code skill (Stage 5)
- [ ] Add RAG memory to make the agent stateful (Stage 6)
- [ ] Write evals + connect observability + deploy (Stage 7)
This example is ~350 lines of Python — more than a typical framework example, but every line is something you'll actually use.
🚧 Advanced extensions¶
If you want to go deeper, this paper-summary-bot can extend into:
- Multi-agent paper review: two agents play supportive vs adversarial reviewer, a third plays area chair → for-researcher branch
- Conference report generator: given a conference proceedings URL, produce per-track high-level summaries → knowledge-worker branch
- Topic trend tracker: weekly arXiv scan, compare new papers against existing memory, produce a weekly digest → personal-assistant branch
Each maps to a specialized branch.
💡 Maintaining this walkthrough¶
This example will evolve over time — SDK interfaces change, frameworks evolve, best practices shift. If something breaks:
- Open an issue with the exact error + your env (Python version, package versions)
- PR fixes should explain "why this change"
- Don't refactor this file to demo only your favorite framework — this is a multi-framework learning example