Exercise 5: Long-term Memory (remember across turns)¶

Pairs with Stage 6 — Memory & RAG Exercise 5.

Task¶

Agent remembers things across conversation turns. Implementation:

Turn 1: user: "I live in Taipei and prefer Python."
        → maybe_remember_fact() catches "I + verb...", stores in vector store
Turn 2: user: "What's 2+2?"        → recall returns nothing relevant, pure arithmetic
Turn 3: user: "Recommend a language for me."
        → recall pulls "prefer Python", drops it into the system prompt → LLM recommends Python

This is RAG's other side — what you retrieve isn't documents, it's conversation history.

How to run¶

Path A (default, free, local)¶

pip install -r requirements.txt
ollama pull qwen2.5:3b
ollama serve
python starter.py

Budget: $0.

Path B (Anthropic)¶

pip install -r requirements.txt
export ANTHROPIC_API_KEY=sk-ant-...
python starter_anthropic.py

Budget: ~$0.001 per run.

Validate the logic¶

python test.py             # 5 tests with mock LLM
python test_anthropic.py   # Anthropic mock

MemoryStore + chat flow¶

class MemoryStore:
    def remember(self, fact: str) -> str:   # add to vector store
    def recall(self, query: str) -> list:    # top-k semantic search

def chat(user_msg, memory):
    memories = memory.recall(user_msg, top_k=3)   # 1. fetch relevant memories
    system = f"Relevant memories: {memories}"      # 2. put them in system prompt
    return llm.invoke(system + user_msg)           # 3. LLM uses them

Key: don't store memory as "context window history" (would blow the limit) — store as "semantic search index" — LLM only sees relevant ones.

vs. plain chat history¶

Dimension	Chat history (in-context)	Vector memory store
Stored where	messages array	ChromaDB
Capacity	Context window (200k-2M tokens)	Unbounded (millions of facts)
Cross-session	❌ Session ends, gone	✅ Persistent
Retrieval	All history in prompt (eats tokens)	top-k semantic search (precise)
Good for	Short conversations, single session	Long-term user relationships, multi-session, persona

Production: use both — last N turns in-context, beyond N + cross-session via vector memory.

What's a "memory-worthy fact"¶

This demo uses a heuristic: user says "I + verb..." → store. Production is more sophisticated:

Explicit trigger: user says "remember that..."
Profile facts: location / language / role / preferences
Past decisions: how the agent handled some situation before
Negative feedback: "don't suggest X" must persist
LLM-extracted: each turn, use an LLM to extract facts (mem0 / Letta / MemGPT all do this)

Common pitfalls¶

Add to memory every turn: vector store explodes. Filter with fact extraction
No dedup: user says "I live in Taipei" 5 turns in a row, 5 copies stored. Add dedup (similarity > 0.95 = duplicate)
No forget / update mechanism: user moved — "I now live in Tokyo". Old "Taipei" memory? Need a supersede concept
No context size control: top-k too big, context bloats, LLM distracted
Privacy / GDPR: user requests deletion; need forget(user_id) API

Production-ready tools¶

mem0: full memory pipeline — auto-fact-extraction, forgetting, user-scoped namespaces
Letta (formerly MemGPT): two-tier memory (working + archival), OS-paging concept for LLMs
CrewAI memory: built-in short / long-term memory
LangGraph checkpointer + persistent storage: thread-level memory out of the box

Extensions¶

Dedup: if similarity(new, existing) > 0.95: skip
LLM-based fact extraction: each turn, a small LLM extracts facts — beats heuristics
user_id scoping: MemoryStore takes user_id filter so users don't contaminate each other
Plug into mem0: don't roll your own memory pipeline in production