Exercise 1: Embeddings + Nearest Neighbors¶
Pairs with Stage 6 — Memory & RAG Exercise 1.
Task¶
Embed 100 sentences, then for a query find the top-k most similar. Observe what cosine similarity ranking means.
How to run — two paths¶
Path A (default, free, local)¶
pip install -r requirements.txt
python starter.py # downloads ~80 MB on first run
Budget: $0. sentence-transformers/all-MiniLM-L6-v2 runs on CPU, ~100 sentences in < 1 second.
Path B (cloud embedding, comparison, very cheap)¶
pip install -r requirements.txt
export OPENAI_API_KEY=sk-...
python starter_anthropic.py
Budget: ~$0.00002 per run (text-embedding-3-small, 100 sentences).
💡 Anthropic doesn't provide an embedding API — they officially recommend Voyage AI. This demo uses OpenAI (most common); swapping to Voyage is a client swap.
Validate the logic¶
python test.py # mock SentenceTransformer, no download
python test_anthropic.py # mock OpenAI client, validate normalize
Core concepts¶
# 1. Encode → vector
sent_vecs = model.encode(sentences, normalize_embeddings=True) # 100 × 384 vec
q_vec = model.encode([query], normalize_embeddings=True)[0] # 384 vec
# 2. Cosine similarity = dot product (because normalized)
sims = sent_vecs @ q_vec # 100 similarity scores
# 3. Top-k
top_idx = np.argsort(-sims)[:top_k]
Why normalize: normalized vectors' dot product equals cosine similarity directly (range [-1, 1]) — no need to recompute norms. Standard vector DB trick.
Local vs cloud embedding¶
| Dimension | sentence-transformers (local) | OpenAI text-embedding-3-small (cloud) |
|---|---|---|
| Dims | 384 | 1536 |
| Speed (100 sents, CPU) | < 1s | 1-2s (incl. network) |
| Cost | $0 | $0.00002 / 100 sentences |
| Multilingual | OK (paraphrase-multilingual-MiniLM-L12-v2) |
Strong |
| Long context (>512 tokens) | Truncated | Strong |
| Determinism | 100% | 99% (API has minor noise) |
Bottom line: personal / small data / local experimentation — sentence-transformers is plenty. Heavy multilingual / long docs / SaaS — go cloud.
Common pitfalls¶
- No normalization: cosine ≠ dot product; compute
sim = dot(a,b) / (|a||b|)yourself - Mixed precision: sentence-transformers defaults to fp32; fp16 quantization (memory savings) shifts similarities 1-2%
- Don't compare vectors across models: MiniLM and OpenAI are different semantic spaces; cosines aren't comparable
- Tiny queries: 1-2 word queries embed poorly; use full sentences
Want better embeddings?¶
# Larger local model (better accuracy, slower)
# In starter.py change MODEL_NAME to:
# "sentence-transformers/all-mpnet-base-v2" # 768 dims, ↑ accuracy
# "sentence-transformers/paraphrase-multilingual-..." # multilingual
# Higher-quality cloud
EMBED_MODEL=text-embedding-3-large python starter_anthropic.py # 3072 dims, $$
Extensions¶
- BM25 + embedding hybrid: combine keyword and semantic — common in production
- Add a reranker: feed top-k to a cross-encoder (
cross-encoder/ms-marco-MiniLM-L-6-v2) — big precision lift - Plug into Exercise 2 vector DB: store in Chroma so you don't re-embed each run