Exercise 3: Chunking Comparison (fixed / paragraph / heading-aware)¶
Pairs with Stage 6 — Memory & RAG Exercise 3.
Task¶
Same document → 3 chunking strategies → 5 queries → see which yields the best retrieval.
| Strategy | How | Good for |
|---|---|---|
| fixed-length | N chars per chunk + overlap | Plain text (logs, chat history) |
| paragraph | Split on double newline | Prose / articles |
| heading-aware | Split on # / ## |
Structured docs (README, wiki, specs) |
How to run¶
pip install -r requirements.txt
python starter.py # first run downloads embedding model
Budget: $0 (fully local).
python test.py # 4 tests for chunking logic
python test_anthropic.py # Path B concept demo
Strategy comparison¶
Fixed-length¶
def chunk_fixed(text, chunk_size=200, overlap=40):
# 200 chars per chunk, 40-char overlap
# Simple but splits sentences
Pros: 1-line implementation, works on any text. Cons: breaks sentences mid-flow.
Paragraph-based¶
def chunk_paragraphs(text):
return text.split("\n\n")
Pros: preserves semantic units. Cons: paragraphs vary in length (10 chars vs 500 chars), throws off embeddings.
Heading-aware¶
def chunk_headings(text):
return re.split(r"\n(?=#{1,3} )", text)
Pros: each chunk includes its heading (extra retrieval signal), structure preserved. Cons: requires markdown / structured docs — won't work on plain text or PDFs without headings.
Observations¶
For query "How much does the MRT cost?":
- fixed-length may cut "EasyCard is the universal payment. A single ride..." across chunks (MRT cost detached from context)
- paragraph retrieves the full Transit paragraph including NT$20-65
- heading-aware retrieves the entire
## Transitsection — embedding for "MRT cost" lands closer to the heading
For "What food can I try?":
- fixed-length hits a chunk containing Food info but with bad boundaries
- paragraph hits one of the Food paragraphs
- heading-aware returns the whole
## Foodsection — Shilin / dan bing / coffee info in one shot
Punchline: chunking strategy caps RAG quality — content the retriever misses, no LLM can answer.
Production considerations¶
- Chunk size: empirically 200-1000 chars (depends on embedding model's token limit; MiniLM is 512)
- Overlap: 10-20% on fixed-length to avoid splitting sentences
- Reranker: retrieve top-20, rerank with a cross-encoder, keep top-5 — big precision boost
- Metadata: tag each chunk
{"section": "Transit"}for filter queries - Hybrid hierarchy: split by heading first, then sub-split each section by fixed-length
Common pitfalls¶
- Chunks too large: embedding averages out details, retrieval misses precise paragraphs
- Chunks too small: insufficient context, LLM can't form holistic answers — need more top-k
- Testing on toy queries: production queries are different. Always validate chunking with real queries
- PDF chunking with markdown logic: PDFs have columns / footnotes / tables — use
pdfplumberorunstructured - CJK text byte-sliced mid-character: UTF-8 byte slice breaks multi-byte chars. Use character-level slicing
Smarter chunking options¶
LangChain RecursiveCharacterTextSplitter: tries\n\n, then\n, then., then char — auto-fallbackLlamaIndex SemanticSplitter: uses embeddings to find semantic boundaries (most accurate, slowest)unstructured: full pipeline for PDF / DOCX / HTML
Extensions¶
- Grid-search chunk_size: across 5 queries, find the size with highest mean similarity
- Metadata filter: tag chunks by section, query "only in Food section"
- Plug into Exercise 4: feed retrieval results to an LLM for full RAG