The Engineering Chronicle
CHUNKING STRATEGIES
The Art and Science of Document Segmentation for AI Systems
"How documents are split determines how well AI systems understand them. Chunking is the invisible foundation upon which all retrieval quality rests—get it wrong, and no amount of sophisticated modeling can compensate."
WHY CHUNKING MATTERS
Embedding models have context limits. A document of 50,000 tokens cannot be embedded as a single unit. It must be divided—but how? This seemingly simple question determines the success or failure of retrieval-augmented systems.
Poor chunking creates orphaned context. A chunk that says "as mentioned above" without including what was mentioned becomes meaningless. A table split mid-row loses its data relationships. A legal clause severed from its definitions becomes uninterpretable.
The embedding vector represents the semantic meaning of the chunk. If the chunk itself lacks coherent meaning, the vector becomes noise—degrading retrieval precision and polluting LLM context windows with irrelevant or misleading content.
CHUNKING APPROACHES
Fixed-size chunking splits text every N tokens regardless of content. Simple to implement, but semantically naive. It serves as a baseline but rarely as a production solution.
Recursive character splitting improves by respecting paragraph and sentence boundaries. LangChain's implementation handles most general content adequately, but struggles with structured documents.
Semantic chunking uses embedding similarity to detect topic boundaries. When consecutive sentences diverge semantically beyond a threshold, a new chunk begins. This preserves topical coherence but requires embedding computation during ingestion.
Document-aware chunking understands structure. Headers define sections. Lists stay unified. Tables preserve row-column relationships. Code blocks maintain syntactic integrity. This requires format-specific parsers but delivers superior results.
OVERLAP STRATEGIES
Chunk overlap creates redundancy—intentionally. When consecutive chunks share boundary content, context that might otherwise be severed is preserved in at least one chunk.
The overlap percentage balances retrieval quality against storage costs. Too little overlap risks losing critical context. Too much creates redundant embeddings that inflate vector stores and can cause duplicate retrieval results.
Empirical testing across document types reveals optimal ranges. Technical documentation benefits from 15-20% overlap. Narrative content works well at 10%. Legal text often requires 25% or higher due to dense cross-referencing.
Adaptive overlap adjusts based on content density. Sections with many internal references get higher overlap automatically. The system learns document patterns during ingestion.
IMPLEMENTATION SPECIFICATIONS
Recursive Splitter Config
- • Chunk Size: 512-1024 tokens (domain-dependent)
- • Overlap: 10-25% of chunk size
- • Separators: ["\n\n", "\n", ". ", " "]
- • Length Function: tiktoken for accuracy
Semantic Splitter Config
- • Embedding Model: sentence-transformers
- • Similarity Threshold: 0.75-0.85
- • Window Size: 3 sentences
- • Min Chunk: 100 tokens
Document-Aware Parsing
- • PDF: PyMuPDF with layout analysis
- • HTML: BeautifulSoup + structure extraction
- • Markdown: Custom header-aware splitter
- • Code: AST-based function boundaries
Quality Metrics
- • Coherence Score: GPT-4 evaluation
- • Retrieval Precision: Ground truth testing
- • Chunk Size Distribution: Variance analysis
- • Overlap Effectiveness: Context coverage
STRATEGY COMPARISON
| Strategy | Best For | Limitations | Compute Cost |
|---|---|---|---|
| Fixed-Size | Homogeneous text, baseline testing | Ignores semantic boundaries | Minimal |
| Recursive | General documents, articles | Misses topic shifts within paragraphs | Low |
| Semantic | Topic-diverse content, research papers | Embedding cost, variable chunk sizes | Medium |
| Document-Aware | Structured docs, legal, technical | Format-specific implementation | Medium-High |
PATENT DOCUMENTS
Patent claims have strict structural requirements. Independent claims must stay whole. Dependent claims need references to their parents preserved. The custom chunker identifies claim boundaries using regex patterns and maintains hierarchical relationships in metadata.
Result: 94% retrieval precision on claim-specific queries, compared to 67% with recursive splitting. The difference directly impacts patent search quality and office action response accuracy.
FINANCIAL REPORTS
Financial documents mix narrative text, tables, and figures. Standard chunking destroys table structure. The document-aware approach extracts tables as markdown, preserving row-column relationships while enabling semantic search over numerical data.
Result: Queries like "Q3 revenue growth" correctly retrieve the relevant table rows, with values intact and properly attributed. This was impossible with naive chunking approaches.
"Chunking is where information retrieval meets information preservation. The best chunk is one that could stand alone as a coherent unit of knowledge."
— RAG Systems Design Principles