22.5 Long-context performance and cost tradeoffs
Overview and links for this section of the guide.
On this page
Goal: faster, cheaper, more reliable long-context work
Long context can make answers better, but it can also make your system slow and expensive.
Your job is to decide how much context to include, where to compress, and when to retrieve instead of paste.
A simple mental model for cost and latency
Most long-context costs come from two facts:
- You pay for what you send (input tokens) and what you get back (output tokens).
- Large prompts increase latency (more text to process and more opportunities for the model to get distracted).
Even if the price is acceptable, latency can kill UX. So “context budgeting” is both a cost and a product decision.
When long context is worth it
Prefer long context when:
- the answer depends on understanding a long argument (policy, legal, reasoning),
- you need to preserve nuance across multiple sections,
- retrieval misses critical constraints and you need more full-document view,
- you are doing a one-off analysis and can tolerate some cost.
Prefer retrieval when:
- users ask many questions over the same corpus,
- most questions only need a few sections,
- you need consistent grounding and citations,
- you care about predictable cost per request.
How to reduce context safely
Reduction techniques that preserve correctness:
- Chunk + retrieve: include only the most relevant chunks.
- Hierarchical summarization: summarize sections, then summarize summaries.
- Extract constraints first: pull out rules/definitions, then answer using those.
- Cache stable context: keep a “doc summary” or “policy constraints” artifact.
Reduction techniques that often destroy correctness:
- summarizing without citations,
- compressing by removing definitions/exceptions,
- asking the model to “remember the document” across sessions without re-providing the sources.
For many doc sets, the key information is a small set of rules and exceptions. Extract that into a versioned artifact and reuse it.
The context budget worksheet
When a long-context workflow feels slow or expensive, answer these:
- What is the minimum information needed to answer?
- How often will this run? (one-off vs repeated)
- What can be cached? (chunk index, doc summaries, extracted constraints)
- What can be retrieved? (relevant chunks only)
- What must be exact? (policy language, numbers, legal terms)
Then pick an approach:
- Prototype: paste a few chunks + citations.
- Scale: retrieval + citations + eval harness.
- Hybrid: retrieval + “long context on demand” for hard questions.
Copy-paste prompts
Prompt: compress without losing constraints
I have a long document. Create a compressed version for repeated Q&A.
Rules:
- Preserve definitions, rules, and exceptions.
- Preserve any numbers, limits, and thresholds exactly.
- Output a structured summary with headings.
- Include citations to chunk ids for every section.
Return:
1) "constraints_summary"
2) "open_questions" (things not stated clearly)