Chunking

1. Overview¶

Chunking splits documents into smaller units before embedding and indexing. It is a critical design choice because it directly determines retrieval granularity, context relevance, latency, and cost. Poor chunking is one of the easiest ways to silently break a RAG system.

2 Why Chunks Must Be Sized Carefully¶

Embedding models have fixed input limits (typically 512 to 8192 tokens), so documents must be split. But chunk size affects quality in both directions:

Too small: Chunks lose semantic meaning; retrieval returns fragments that don't fully answer the question.
Too large: Chunks contain multiple unrelated topics; embedding quality degrades; context window fills with noise.

3. Chunking Strategies¶

Fixed-Size Chunking¶

Documents are split into consecutive windows of N tokens with optional overlap. Simple and fast, but ignores semantic boundaries.

Algorithm:

Tokenise the document.
Split into consecutive windows of size N.
Optionally overlap adjacent windows by M tokens.
Pros: Simple to implement; fast and scalable; good baseline.
Cons: Ignores semantic boundaries; may split sentences mid-thought.
Use when: Baseline systems; uniform document formats; large-scale indexing where simplicity matters.

Sentence-Based Chunking¶

Documents are split at sentence boundaries, accumulating sentences until a token threshold is reached.

Pros: Preserves sentence semantics; reduces mid-sentence splits.
Cons: Sentence lengths vary; ignores higher-level document structure.
Use when: Narrative text; QA over articles or reports.

Paragraph-Based Chunking¶

Chunks are formed at paragraph boundaries and merged if small; large paragraphs are split further if needed.

Pros: Preserves local topical coherence; aligns with human-written structure.
Cons: Paragraph length is highly inconsistent; formatting noise can affect quality.
Use when: Well-structured documentation; markdown or HTML content.

Recursive Chunking¶

Applies a hierarchy of split rules — sections → paragraphs → sentences → fixed-size fallback — only falling back to finer splits when the chunk exceeds the size limit.

Pros: Preserves document structure; produces semantically meaningful chunks; handles diverse formats.
Cons: More complex to implement; requires reliable document parsing.
Use when: Enterprise documents; PDFs with headings; mixed-format content. This is the most common production approach.

Semantic / Context-Aware Chunking¶

Adjacent text units are grouped based on embedding similarity rather than fixed boundaries.

Pros: High semantic coherence; reduces context fragmentation.
Cons: Computationally expensive — requires embedding during preprocessing; sensitive to similarity thresholds.
Use when: High-precision RAG; smaller corpora where quality matters most.

Proposition-Based Chunking¶

Use an LLM to extract atomic factual propositions from each paragraph, storing each as a micro-chunk. Example: a paragraph about a company becomes individual chunks like "Founded in 2010", "Headquartered in Berlin", "Has 500 employees."

Pros: Maximum semantic precision; each chunk answers exactly one question; retrieval is very accurate.
Cons: LLM inference required at index time (expensive); chunks lose surrounding narrative context.
Use when: High-precision QA over structured factual content; small corpora where index build cost is acceptable.

Late Chunking¶

(Jina AI, 2024) Embed the full document first using a long-context encoder, then chunk the resulting token embeddings rather than the raw text. Each chunk's embeddings carry context from the surrounding document.

Pros: Chunks retain cross-chunk context; resolves co-reference ("it", "the company") within chunks; better retrieval for narrative documents.
Cons: Requires a long-context embedding model; cannot be applied to corpora already indexed with standard chunking.
Use when: Documents where context from earlier/later paragraphs is needed to interpret individual chunks (legal contracts, research papers, long-form articles).

Sliding Window Chunking¶

Overlapping windows slide across the document (e.g., 512-token window, 256-token stride).

Pros: Preserves cross-boundary context; reduces information loss at chunk edges.
Cons: Doubles or more the index size; higher storage and retrieval cost; redundant embeddings.
Use when: Long-form documents; multi-hop reasoning tasks; cases where boundary loss is critical.

4 Chunk Size and Top-k Are Coupled¶

Changing chunk size almost always requires adjusting top-k. They must be tuned jointly.

Chunk Size	Typical Top-k	Behaviour
Small (100–300 tokens)	High (10–20)	High recall, lower precision — many fragments retrieved
Medium (300–700 tokens)	Medium (4–8)	Balanced — good default starting point
Large (700–1500 tokens)	Low (1–3)	High precision, risk of missing relevant info

Common failure patterns:

Small chunks + low top-k → missing required information
Large chunks + high top-k → context overload and noise
Large chunks + low top-k → partial coverage

5 Chunk Overlap¶

Overlap (sharing tokens between adjacent chunks) prevents information loss at chunk boundaries.

Typical settings:

Fixed-size chunking: 10–20% overlap
Sliding window: stride equals 50% of window size
Recursive chunking: overlap often unnecessary

Benefits: Improved recall; reduced boundary effects.

Costs: Larger index; higher storage and retrieval cost; redundant embeddings.

Overlap is a mitigation strategy, not a substitute for good chunking design.

6 Chunk Metadata and Filtering¶

Attaching structured metadata to each chunk enables filtering before or after similarity search — one of the highest-ROI improvements in a vanilla RAG system.

Common metadata fields: document ID, section heading, timestamp/version, author, content type, access permissions.

How it's used:

Pre-filter by document type, date, or access permission before ANN search (faster, but risks reducing recall if over-filtered).
Post-filter after ANN search (preserves recall, wastes compute on irrelevant candidates).

Example: Retrieve only chunks from documents created after a certain date, or from a specific product version.

7 Special Cases: Tables and Code¶

Text-centric chunking destroys the structure of tables and source code.

Tables:

Never split a table row across chunks.
Attach the table schema and column headers as metadata to every row-chunk.
Consider serialising rows to natural language for embedding.

Code:

Chunk at function or class boundaries — never split a function across chunks.
File-level chunking for small files is acceptable.
Long-range dependencies mean that smaller granularity (line-level) loses context.

8 Adaptive Chunking¶

Different queries require different granularity. A single static chunking strategy cannot optimally serve all query types.

Query Type	Preferred Chunking
Fact lookup	Small chunks
Concept explanation	Medium chunks
Procedural steps	Large chunks
Multi-hop reasoning	Overlapping or sliding window

Adaptive approach: Maintain multiple indexes with different chunk sizes and select based on query classification. Higher accuracy, but more system complexity.