Skip to content

Exercise 2: Vector DB (ChromaDB) + semantic vs keyword

Pairs with Stage 6 — Memory & RAG Exercise 2.

Task

Index 8 docs into Chroma; compare semantic (vector) vs keyword (substring) retrieval on the same query.

How to run

pip install -r requirements.txt
python starter.py   # auto-downloads embedding model on first run

Budget: $0. In-memory mode; released after process exits.

python test.py             # 5 tests for index/query/ranking
python test_anthropic.py   # Path B concept demo (same as starter)

When to use a vector DB

Scenario List + cosine ChromaDB
< 100 docs ✅ enough overkill
100-10K docs Slow (re-embed each query) ✅ persistent + indexed
10K+ docs No ✅ (consider Qdrant / Weaviate at huge scale)
Persistence Re-embed ✅ SQLite backend
Filter / metadata DIY ✅ where clause
Hybrid search DIY ✅ built-in BM25 + vector

Rule of thumb: experimentation = EphemeralClient; production = PersistentClient(path=...).

Semantic vs keyword

Query: "where to drink good coffee in Asian cities"

📝 Keyword (substring) → misses doc 3
    Query doesn't have the exact word "coffee"

🔍 Semantic (vector) → hits doc 3
    "Coffee shops in Taipei often serve pour-over..."
    Semantic alignment, not literal match
Dimension Keyword Semantic
Synonyms ("car" vs "auto") Miss Catch
Rephrasings Miss Catch
Typos Miss Catch (small)
Exact proper nouns Strong Occasionally confused
Negation (NOT) Easy Hard (embeddings don't grok negation)
Speed Fast Medium (need to embed the query)
Production BM25 + vector hybrid Same

Production takeaway: use both — hybrid search is best practice. Chroma 0.4+ has BM25 + vector built in.

Chroma API

client = chromadb.EphemeralClient()    # in-memory; PersistentClient(path=...) for disk
embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="...")
collection = client.get_or_create_collection(name="demo", embedding_function=embed_fn)

collection.add(ids=[...], documents=[...], metadatas=[{"category": "..."}, ...])
collection.query(query_texts=[query], n_results=3, where={"category": "tech"})
collection.upsert(...)
collection.delete(ids=[...])

Common pitfalls

  • Duplicate ids in .add(): raises. Use .upsert() or check .get()["ids"] first
  • Rebuilding the collection each query: don't! PersistentClient indexes once
  • n_results too high: no reranker — large k pulls in noise. 3-10 typically
  • Filter confusion: where={"category": "tech"} is metadata; where_document={"$contains": "..."} is content
  • Inconsistent embedding function: indexing with model A and querying with model B breaks retrieval. Chroma binds embedding_function to the collection to prevent this

Production-ready alternatives

# Persistent
collection = build_collection(path="./chroma_db")

# Cloud embeddings (higher quality)
embed_fn = embedding_functions.OpenAIEmbeddingFunction(api_key=..., model_name="text-embedding-3-small")

Extensions

  • Metadata filter: collection.query(query_texts=[q], where={"category": "food"})
  • Hybrid search: BM25 + vector via Chroma 0.4+ or external rank_bm25
  • Swap to Qdrant / Weaviate at production scale
  • Plug into Exercise 4: full RAG pipeline reuses this collection