Chunking, the process of splitting documents into smaller pieces for embedding and retrieval, is one of the most underappreciated factors in RAG system quality. Get chunking wrong, and your retrieval system will return irrelevant fragments that confuse the LLM. Get it right, and you enable precise, context-rich retrieval that produces accurate, well-grounded responses. This guide covers every major chunking strategy, from simple fixed-size splitting to advanced semantic and agentic approaches.

Why Chunking Matters

Documents are too large to embed as single units and too large to fit into an LLM's context window as retrieved context. Chunking solves this by breaking documents into smaller pieces that can be individually embedded, retrieved, and passed to the LLM. But the way you chunk has enormous implications for retrieval quality.

The fundamental trade-off is between chunk size and retrieval precision. Small chunks are more precise but may lack necessary context. Large chunks provide more context but are less focused and consume more of the LLM's context window. Finding the right balance requires understanding your documents, your queries, and your use case.

"Chunking is where information architecture meets retrieval engineering. The best chunking strategy preserves the semantic integrity of your content while creating units that are individually meaningful and retrievable."

Chunking Strategies

Fixed-Size Chunking

The simplest approach: split text into chunks of a fixed number of characters or tokens, with optional overlap between consecutive chunks. This is the default in most RAG tutorials and works reasonably well for homogeneous text.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,      # characters per chunk
    chunk_overlap=50,    # overlap between consecutive chunks
    separators=["\n\n", "\n", ". ", " ", ""]
)

The RecursiveCharacterTextSplitter is smarter than a naive fixed-size split because it tries to break at natural boundaries (paragraphs first, then sentences, then words) while respecting the size limit.

Semantic Chunking

Semantic chunking splits text based on meaning rather than character count. It uses embeddings to identify natural topic boundaries: when the embedding similarity between consecutive sentences drops below a threshold, a new chunk starts. This produces chunks that are semantically coherent, each covering a single topic or idea.

Document-Structure-Based Chunking

For structured documents like HTML, Markdown, or PDF with headers, you can chunk based on the document structure itself. Split at section headers, creating chunks that correspond to the natural sections of the document. This preserves the author's intended organization and produces chunks with clear topical boundaries.

Sentence-Window Chunking

A hybrid approach: embed individual sentences for precise retrieval, but when a sentence is retrieved, return it along with a window of surrounding sentences for context. This gives you the retrieval precision of sentence-level chunking with the context richness of larger chunks.

Key Takeaway

There is no single best chunking strategy. The right choice depends on your document types, query patterns, and the trade-off between precision and context that your application requires.

Choosing the Right Chunk Size

Chunk size is the most frequently asked question about chunking. Here are practical guidelines:

  • 200-500 tokens: Good for precise, fact-based retrieval where specific details matter. Best for FAQ-style queries and factual lookup.
  • 500-1000 tokens: The most common range for general-purpose RAG systems. Provides a good balance of precision and context.
  • 1000-2000 tokens: Better for tasks requiring broader context, such as summarization, analysis, or questions that span multiple paragraphs.

The Role of Overlap

Chunk overlap creates redundancy between consecutive chunks. A sentence at the end of chunk N also appears at the beginning of chunk N+1. This overlap serves two purposes: it prevents important information from being split across chunk boundaries where it might be lost, and it provides context continuity so that each chunk can stand alone more effectively.

Typical overlap ranges from 10% to 20% of the chunk size. More overlap means better boundary coverage but also more storage and slightly slower indexing. Less overlap is more efficient but risks losing information at boundaries.

Advanced Chunking Techniques

Parent-Child Chunking

Embed small chunks for precise retrieval, but store references to larger parent chunks. When a small chunk is retrieved, return the larger parent to the LLM for more context. This gives you the best of both worlds: precise retrieval with rich context.

Agentic Chunking

Use an LLM to decide how to chunk each document. The LLM reads the document and identifies the natural boundaries based on topic shifts, argument structure, and content organization. This is the most sophisticated approach but also the most expensive to run.

Multi-Level Chunking

Create multiple representations of the same document at different granularity levels: sentence-level, paragraph-level, and section-level. At query time, search across all levels and combine the results for comprehensive retrieval.

Evaluating Your Chunking Strategy

  1. Manually inspect chunks: Read through a sample of your chunks. Do they make sense as standalone units? Do they contain complete thoughts?
  2. Test retrieval quality: Run your evaluation queries and check whether the retrieved chunks actually contain the information needed to answer them.
  3. Measure end-to-end performance: The ultimate test is whether better chunking produces better final answers from the LLM.
  4. A/B test strategies: Compare different chunking approaches on the same evaluation set to find the optimal strategy for your data.

Key Takeaway

Start with recursive character splitting at 500 tokens with 50-token overlap. This baseline works well for most use cases. Only invest in more sophisticated chunking strategies when evaluation shows that simple chunking is the bottleneck in your retrieval quality.