Exercise 2: Vector DB (ChromaDB) + semantic vs keyword¶

Pairs with Stage 6 — Memory & RAG Exercise 2.

Task¶

Index 8 docs into Chroma; compare semantic (vector) vs keyword (substring) retrieval on the same query.

How to run¶

pip install -r requirements.txt
python starter.py   # auto-downloads embedding model on first run

Budget: $0. In-memory mode; released after process exits.

python test.py             # 5 tests for index/query/ranking
python test_anthropic.py   # Path B concept demo (same as starter)

When to use a vector DB¶

Scenario	List + cosine	ChromaDB
< 100 docs	✅ enough	overkill
100-10K docs	Slow (re-embed each query)	✅ persistent + indexed
10K+ docs	No	✅ (consider Qdrant / Weaviate at huge scale)
Persistence	Re-embed	✅ SQLite backend
Filter / metadata	DIY	✅ where clause
Hybrid search	DIY	✅ built-in BM25 + vector

Rule of thumb: experimentation = EphemeralClient; production = PersistentClient(path=...).

Semantic vs keyword¶

Query: "where to drink good coffee in Asian cities"

📝 Keyword (substring) → misses doc 3
    Query doesn't have the exact word "coffee"

🔍 Semantic (vector) → hits doc 3
    "Coffee shops in Taipei often serve pour-over..."
    Semantic alignment, not literal match

Dimension	Keyword	Semantic
Synonyms ("car" vs "auto")	Miss	Catch
Rephrasings	Miss	Catch
Typos	Miss	Catch (small)
Exact proper nouns	Strong	Occasionally confused
Negation (NOT)	Easy	Hard (embeddings don't grok negation)
Speed	Fast	Medium (need to embed the query)
Production	BM25 + vector hybrid	Same

Production takeaway: use both — hybrid search is best practice. Chroma 0.4+ has BM25 + vector built in.

Chroma API¶

client = chromadb.EphemeralClient()    # in-memory; PersistentClient(path=...) for disk
embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="...")
collection = client.get_or_create_collection(name="demo", embedding_function=embed_fn)

collection.add(ids=[...], documents=[...], metadatas=[{"category": "..."}, ...])
collection.query(query_texts=[query], n_results=3, where={"category": "tech"})
collection.upsert(...)
collection.delete(ids=[...])

Common pitfalls¶

Duplicate ids in .add(): raises. Use .upsert() or check .get()["ids"] first
Rebuilding the collection each query: don't! PersistentClient indexes once
n_results too high: no reranker — large k pulls in noise. 3-10 typically
Filter confusion: where={"category": "tech"} is metadata; where_document={"$contains": "..."} is content
Inconsistent embedding function: indexing with model A and querying with model B breaks retrieval. Chroma binds embedding_function to the collection to prevent this

Production-ready alternatives¶

# Persistent
collection = build_collection(path="./chroma_db")

# Cloud embeddings (higher quality)
embed_fn = embedding_functions.OpenAIEmbeddingFunction(api_key=..., model_name="text-embedding-3-small")

Extensions¶

Metadata filter: collection.query(query_texts=[q], where={"category": "food"})
Hybrid search: BM25 + vector via Chroma 0.4+ or external rank_bm25
Swap to Qdrant / Weaviate at production scale
Plug into Exercise 4: full RAG pipeline reuses this collection