The Definitive Guide to Chunking Strategies for LLMs and RAG Systems

I fed an entire 80 page PDF to a language model. It hallucinated three facts, missed the key clause on page 47, and burned $2.30 in API costs. The model wasn't the problem. My chunking strategy was.

Sound familiar? I spent months blaming my prompts, my model choice, even my vector database before I realized the real bottleneck was one step earlier: how I was splitting my documents before they reached the model.

Here's what I learned, and what actually worked.

My Wake Up Call

I was building a RAG pipeline for a legal document assistant. The embeddings looked great. Retrieval hit rate was solid. But the generated answers kept skipping context, missing cross references, or, worst of all, inventing details with total confidence.

After a week of tuning prompts and swapping models, I finally added logging around what was actually being retrieved. The retrieved chunks were split mid sentence. Key definitions were cut off from the clauses that used them. Section headers were in one chunk; their content was in a completely different one.

I wasn't feeding the model context. I was feeding it fragments.

That's when I built a real chunking system. Here's what changed everything.

What Chunking Actually Is (And Why It Matters)

Chunking means splitting large documents into smaller segments before embedding them into a vector store or passing them as context to a model. Context windows are finite and expensive. Even at 200K tokens, you don't want to dump an entire corpus into every query.

The goal is to surface the most relevant passage, not the whole document. The problem is that a poorly designed chunk can sever a sentence mid clause, separate a concept from its definition, or strip the heading that gives a paragraph meaning. These micro failures cascade directly into hallucinations and retrieval misses.

There are two failure modes to understand:

Context fragmentation: The chunk is too small or split at the wrong boundary, so the retrieved piece answers only part of the question.

Context dilution: The chunk is too large, so the right answer is buried in thousands of irrelevant tokens, degrading generation quality and burning budget.

The ideal chunk: a semantically complete unit. Large enough to carry full meaning, small enough to be precise.

The 5 Chunking Strategies That Changed Everything

1. Fixed Size Chunking: The Baseline You Need to Beat

Split every N characters or tokens, no questions asked. It's fast, predictable, and almost always wrong for anything that matters. Use this only for pre processing unstructured logs or raw numeric data where sentence integrity doesn't matter.

def fixed_chunk(text: str, size: int = 512) -> list[str]:
    return [text[i:i + size] for i in range(0, len(text), size)]

The trick: Treat this as your worst case baseline. Measure its performance. Then beat it with every other strategy on this list.

2. Sentence Boundary Chunking: The Obvious Upgrade

Split on natural language boundaries: sentences, then paragraphs. Group them up to a token limit. Instantly better than fixed size for almost any prose document. This is the default starting point for news articles, blog posts, and documentation.

from nltk.tokenize import sent_tokenize

def sentence_chunk(text: str, max_tokens: int = 256) -> list[str]:
    sentences = sent_tokenize(text)
    chunks, current, length = [], [], 0

    for s in sentences:
        n = len(s.split())  # use tiktoken for precision
        if length + n > max_tokens and current:
            chunks.append(" ".join(current))
            current, length = [], 0
        current.append(s)
        length += n

    if current:
        chunks.append(" ".join(current))
    return chunks

The trick: Use tiktoken instead of len(s.split()) to get precise token counts that match your actual embedding model's tokenizer.

3. Overlapping (Sliding Window) Chunks: The Reliability Fix

This one changed my retrieval quality overnight. Each chunk shares N tokens with the previous one, so information at a boundary isn't lost. I add overlap defensively to almost every pipeline now, even when using sentence boundary chunking.

def overlapping_chunk(
    tokens: list[str],
    chunk_size: int = 512,
    overlap: int = 64
) -> list[list[str]]:
    stride = chunk_size - overlap
    return [
        tokens[i:i + chunk_size]
        for i in range(0, len(tokens), stride)
        if tokens[i:i + chunk_size]
    ]

# With tiktoken for precise token counting:
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(document_text)
chunked_tokens = overlapping_chunk(tokens, chunk_size=512, overlap=64)
text_chunks = [enc.decode(c) for c in chunked_tokens]

The trick: Set overlap to 10–15% of your chunk size. For a 512 token chunk with 64 token overlap, that's a reliable default. More overlap than that creates retrieval noise without proportional accuracy gains.

4. Hierarchical Chunking: The Precision Upgrade for Structured Documents

For documents with clear sections (legal contracts, technical manuals, research papers), create two levels: a coarse summary per section and fine grained child chunks within it. Retrieval finds the right section first, then drills into the exact passage.

LlamaIndex's HierarchicalNodeParser and AutoMergingRetriever make this production ready in about 20 lines.

from llama_index.core.node_parser import HierarchicalNodeParser
from llama_index.core.retrievers import AutoMergingRetriever

# Build index with coarse → fine chunk levels
parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]
)
nodes = parser.get_nodes_from_documents(documents)

# AutoMergingRetriever merges child nodes back into parents at retrieval time
retriever = AutoMergingRetriever(
    index.as_retriever(similarity_top_k=12),
    storage_context,
    verbose=True
)

The trick: Embed the section summaries for coarse retrieval. They're your index: compact, signal rich, and cheap to search. Only expand to child chunks once the right section is identified.

5. Semantic Chunking: Let the Embeddings Decide the Splits

The most powerful strategy for mixed topic or user generated content. Embed every sentence, then split wherever cosine similarity drops between adjacent sentences. That drop signals a topic boundary and a natural split point.

Variable chunk sizes are a tradeoff, but quality is consistently better than anything rule based.

import numpy as np
from sentence_transformers import SentenceTransformer

def semantic_chunk(
    sentences: list[str],
    model_name: str = "all-MiniLM-L6-v2",
    threshold: float = 0.75
) -> list[str]:
    model = SentenceTransformer(model_name)
    embs = model.encode(sentences, normalize_embeddings=True)

    chunks, current = [], [sentences[0]]

    for i in range(1, len(sentences)):
        sim = float(np.dot(embs[i - 1], embs[i]))
        if sim >= threshold:
            current.append(sentences[i])
        else:
            chunks.append(" ".join(current))
            current = [sentences[i]]

    if current:
        chunks.append(" ".join(current))
    return chunks

The trick: Add a max chunk size guard. Semantic chunking can produce huge chunks when a topic runs long; cap at 1024 tokens to keep retrieval fast and generation focused.

My Recent Win

Last month, I rebuilt a RAG pipeline for 300 internal policy PDFs. Old me would've done fixed size chunking and called it done. Instead:

Day 1: Audited the corpus for structure. Found clear section headers and obvious hierarchy. Decision: hierarchical chunking.
Day 2: Added 64 token overlap to every child chunk as a defensive measure. Set chunk sizes to 2048 / 512 / 128 for coarse, mid, and fine levels.
Day 3: Ran 50 test queries. Measured retrieval precision@3 and answer faithfulness using an LLM as judge. Faithfulness jumped from 61% to 84% versus the fixed size baseline.
Day 4: Added metadata to every chunk: source document, section title, page number, chunk index. Made debugging fast and citations production ready.
Day 5: Shipped. No hallucinated policy clauses. No missed cross references. Answer quality matched the SLA the client expected.

Done. No heroics. Just the right strategy applied to the right document structure.

A developer reviewing a chunked document pipeline on screen The right chunking strategy isn't the smartest one; it's the one you've actually measured on your real data.

The Trade Offs in Plain Terms

Fixed Size Chunking

Accuracy: ⭐⭐
Token Efficiency: ⭐⭐⭐⭐⭐
Speed: ⭐⭐⭐⭐⭐
Complexity: ⭐
Best for: Pre processing logs, unstructured raw data

Sentence/Paragraph Chunking

Accuracy: ⭐⭐⭐
Token Efficiency: ⭐⭐⭐⭐
Speed: ⭐⭐⭐⭐
Complexity: ⭐⭐
Best for: News articles, blog posts, documentation

Overlapping Chunks

Accuracy: ⭐⭐⭐⭐
Token Efficiency: ⭐⭐⭐
Speed: ⭐⭐⭐⭐
Complexity: ⭐⭐
Best for: Adding to any strategy as a defensive measure

Hierarchical Chunking

Accuracy: ⭐⭐⭐⭐⭐
Token Efficiency: ⭐⭐⭐⭐
Speed: ⭐⭐⭐
Complexity: ⭐⭐⭐⭐
Best for: Legal documents, technical manuals, structured PDFs

Semantic Chunking

Accuracy: ⭐⭐⭐⭐⭐
Token Efficiency: ⭐⭐⭐
Speed: ⭐⭐⭐
Complexity: ⭐⭐⭐
Best for: Mixed topic content, user generated content

Key Trade Offs:

Overlap adds roughly 20% to index size, but retrieval latency doesn't meaningfully change at typical scales.
Hierarchical chunking requires a two stage retrieval pipeline, which is more to build and debug, but the precision payoff for structured documents justifies the complexity.

How to Pick Your Strategy

Pick one document type that's producing bad answers. Then answer these three questions:

Does your document have clear headings and sections?

Yes: Use hierarchical chunking.
No: Use sentence boundary chunking as your starting point.

What's your primary constraint?

Cost: Keep chunk sizes large (512–1024 tokens), skip overlap, measure faithfulness to know when you've gone too far.
Accuracy: Use hierarchical or semantic chunking. Budget for the index time embedding cost.
Latency: Pre compute overlapping chunks, cache them in your vector store, never compute at query time.

What are you measuring?

Retrieval precision@k is necessary but not sufficient.
Always measure end to end answer faithfulness, not just hit rate.
A pipeline that retrieves correctly but generates poorly still ships bad answers.

A practical starting point for most projects:

Chunk size: 256–512 tokens per chunk
Overlap: 10–15% of chunk size
Strategy: Sentence or paragraph boundaries
Always include: Metadata on every chunk

Common Mistakes (I Made All of Them)

Chunking before cleaning. Boilerplate, nav menus, and footer noise fragment into dozens of junk chunks that pollute retrieval. Solution: Clean your documents first, always.

Using character count as a token proxy. Different tokenizers give different counts for the same string. Solution: Use the tokenizer your embedding model actually uses. Use tiktoken for OpenAI models or the model's own tokenizer for everything else.

Skipping metadata. A chunk without a source, section, and index is nearly impossible to debug at scale. Solution: Store metadata from day one. Include source document, section title, page number, and chunk index.

Evaluating only retrieval, not generation. Retrieval precision tells you the chunk was found, but doesn't guarantee the answer is correct. Solution: Measure end to end faithfulness using an LLM as judge, not just hit rate.

One strategy for all document types. Different document types require different chunking approaches. Solution: Route structured PDFs to hierarchical chunking, raw web text to semantic, and logs to fixed size chunking.

The Truth About Chunk Size

There is no universal optimal. Most practitioners start at 256–512 tokens and tune empirically.

Here's the protocol:

Sample 50–100 representative queries from real users.
Test chunk sizes: 128, 256, 512, and 1024 tokens.
Measure retrieval precision@k and answer faithfulness at each size.
Pick the smallest chunk size that maintains acceptable faithfulness scores for your domain.

That's it. Don't guess. Measure.

FAQs

Q: What is the optimal chunk size for RAG?

There isn't a universal optimal size. Most practitioners start at 256–512 tokens and tune empirically based on their data. The right size depends on your document type, query distribution, and answer faithfulness targets, not a general rule.

Q: How much overlap should I add between chunks?

10–15% of your chunk size is a reliable default. For a 512 token chunk, add 50–75 tokens of overlap. More overlap than that creates retrieval noise without proportional accuracy gains.

Q: Does better chunking actually reduce hallucinations?

Yes, significantly. When retrieved context is fragmented or irrelevant, models fill the gap by generating plausible sounding but unsupported content. Coherent chunks give the model complete information to work from, directly reducing confabulation rates.

Q: Can I use different strategies for different document types in the same pipeline?

Not only can you, you should. A production system typically routes documents based on type: structured PDFs with headings go to hierarchical chunking, raw web text to semantic, tabular exports to fixed size. One size fits all is a premature optimization.

Q: Should I chunk before or after embedding?

Always chunk first, then embed each chunk independently. Embedding an unchunked document produces a single vector that captures average meaning, which is too coarse for precise retrieval.

Key Takeaways

Pick one document type in your pipeline that's producing bad answers. Just one.

Implement sentence boundary chunking today.
Add 64 token overlap.
Run your test queries and measure faithfulness.
Don't architect the perfect system on day one. Iterate based on real data.

Because the best chunking strategy is the one you've actually measured on your real data.

Thanks for reading ! Until next time , Stay curious. ~ Vansh Garg