Exercise 1: Embeddings + Nearest Neighbors¶

Pairs with Stage 6 — Memory & RAG Exercise 1.

Task¶

Embed 100 sentences, then for a query find the top-k most similar. Observe what cosine similarity ranking means.

How to run — two paths¶

Path A (default, free, local)¶

pip install -r requirements.txt
python starter.py   # downloads ~80 MB on first run

Budget: $0. sentence-transformers/all-MiniLM-L6-v2 runs on CPU, ~100 sentences in < 1 second.

Path B (cloud embedding, comparison, very cheap)¶

pip install -r requirements.txt
export OPENAI_API_KEY=sk-...
python starter_anthropic.py

Budget: ~$0.00002 per run (text-embedding-3-small, 100 sentences).

💡 Anthropic doesn't provide an embedding API — they officially recommend Voyage AI. This demo uses OpenAI (most common); swapping to Voyage is a client swap.

Validate the logic¶

python test.py             # mock SentenceTransformer, no download
python test_anthropic.py   # mock OpenAI client, validate normalize

Core concepts¶

# 1. Encode → vector
sent_vecs = model.encode(sentences, normalize_embeddings=True)  # 100 × 384 vec
q_vec = model.encode([query], normalize_embeddings=True)[0]      # 384 vec

# 2. Cosine similarity = dot product (because normalized)
sims = sent_vecs @ q_vec        # 100 similarity scores

# 3. Top-k
top_idx = np.argsort(-sims)[:top_k]

Why normalize: normalized vectors' dot product equals cosine similarity directly (range [-1, 1]) — no need to recompute norms. Standard vector DB trick.

Local vs cloud embedding¶

Dimension	sentence-transformers (local)	OpenAI text-embedding-3-small (cloud)
Dims	384	1536
Speed (100 sents, CPU)	< 1s	1-2s (incl. network)
Cost	$0	$0.00002 / 100 sentences
Multilingual	OK (`paraphrase-multilingual-MiniLM-L12-v2`)	Strong
Long context (>512 tokens)	Truncated	Strong
Determinism	100%	99% (API has minor noise)

Bottom line: personal / small data / local experimentation — sentence-transformers is plenty. Heavy multilingual / long docs / SaaS — go cloud.

Common pitfalls¶

No normalization: cosine ≠ dot product; compute sim = dot(a,b) / (|a||b|) yourself
Mixed precision: sentence-transformers defaults to fp32; fp16 quantization (memory savings) shifts similarities 1-2%
Don't compare vectors across models: MiniLM and OpenAI are different semantic spaces; cosines aren't comparable
Tiny queries: 1-2 word queries embed poorly; use full sentences

Want better embeddings?¶

# Larger local model (better accuracy, slower)
# In starter.py change MODEL_NAME to:
#   "sentence-transformers/all-mpnet-base-v2"           # 768 dims, ↑ accuracy
#   "sentence-transformers/paraphrase-multilingual-..." # multilingual

# Higher-quality cloud
EMBED_MODEL=text-embedding-3-large python starter_anthropic.py   # 3072 dims, $$

Extensions¶

BM25 + embedding hybrid: combine keyword and semantic — common in production
Add a reranker: feed top-k to a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2) — big precision lift
Plug into Exercise 2 vector DB: store in Chroma so you don't re-embed each run