22.1 Chunking strategies that preserve meaning

Overview and links for this section of the guide.

Goal: chunk without destroying meaning

Chunking is how you turn a long document into pieces that can be:

  • retrieved later,
  • cited reliably,
  • summarized with less hallucination,
  • kept within a context budget.

The goal is not “split every N characters.” The goal is: each chunk should be understandable and useful on its own.

Principles of good chunks

  • Semantic coherence: keep one idea or subtopic per chunk.
  • Self-contained context: include definitions and prerequisites when possible.
  • Stable boundaries: chunk boundaries should not change wildly with minor doc edits.
  • Retrieval friendliness: include terms users will query for (headings help).
  • Minimal redundancy: overlap helps, but too much overlap wastes budget and confuses answers.
Avoid “character-count chunking” as a default

Splitting by raw length often cuts definitions from their usage, tables from their explanations, and policies from their exceptions.

Chunking methods you can actually use

1) Structure-first (headings/sections)

If your doc has headings, use them. This is the highest leverage method:

  • split by h2/h3 sections (or markdown headings),
  • keep the heading path as metadata (e.g., “Security > Secrets > Rotation”),
  • merge tiny sections with a neighbor; split huge sections by paragraphs.

2) Sliding window with overlap (for messy text)

For logs, transcripts, or unstructured text, use a sliding window:

  • choose a target chunk size,
  • include an overlap (10–20%) to preserve continuity,
  • preserve time ranges (for logs/transcripts) as metadata.

3) Semantic chunking (split by topic shifts)

Use a model-assisted pass to find topic shifts:

  • ask the model to label paragraphs by topic,
  • group adjacent paragraphs with the same label,
  • use those groups as chunks.

This works well for long prose but requires careful validation, because the model can mis-label boundaries.

Metadata: the difference between “text” and “knowledge”

Without metadata, you can’t ground answers. At minimum, store:

  • doc_id: which document this came from.
  • chunk_id: a stable id (don’t rely on array index alone).
  • title_path: heading hierarchy (if available).
  • source location: page number, paragraph index, or byte range.
  • timestamp/version: doc version or last updated time.

Example chunk record (conceptual):

{
  "doc_id": "policy-security-v3",
  "chunk_id": "3.2-secrets-rotation",
  "title_path": ["Security", "Secrets", "Rotation"],
  "source": { "page": 12, "start_paragraph": 4, "end_paragraph": 9 },
  "text": "..."
}

How to evaluate chunk quality

Use a quick checklist:

  • Can a human understand it alone? If not, it’s missing context.
  • Does it contain both rules and exceptions? Policy chunks should include caveats.
  • Does it include keywords a user would search for? Headings matter.
  • Is it too large? Huge chunks reduce retrieval precision and increase cost.
  • Is it too small? Tiny chunks lose meaning and increase retrieval noise.

Then test with a few representative queries: can you retrieve the chunk you expect?

Copy-paste prompts

Prompt: propose chunk boundaries for a document

I have a long document. I want to chunk it for retrieval.

Requirements:
- Chunks should be semantically coherent and understandable on their own.
- Prefer splitting on headings. If headings are missing, split on topic shifts.
- Output chunk boundaries with stable ids.

Return JSON:
{
  "chunks": [{
    "chunk_id": string,
    "title_path": string[],
    "start_hint": string,
    "end_hint": string,
    "notes": string
  }]
}

Ask clarifying questions if needed (doc type, average length, intended queries).

Where to go next