29.4 Caching strategies (prompt+context caching)
Overview and links for this section of the guide.
On this page
Goal: reduce cost and latency safely
LLM apps are often limited by cost and latency. Caching is the highest-leverage tool to improve both.
But caching can also break correctness and privacy if done carelessly.
The goal is: cache what is safe and stable, and version cache keys so you don’t serve stale or cross-tenant answers.
What you can cache (and what you should not)
High-value caching targets:
- Embeddings: chunk embeddings and query embeddings.
- Retrieval results: top-k chunk ids for frequent queries (with filters/versioning).
- Reranking results: selected chunk ids for frequent queries.
- Prompt templates: rendered system prompts or “house rules.”
- Final answers: for repeated identical queries in the same context (careful!).
- Derived artifacts: doc summaries, constraint extracts, chunk indexes.
Things you usually should not cache broadly:
- Answers containing sensitive information unless you have strong access control and isolation.
- Cross-tenant cached outputs (high leakage risk).
- Outputs without version keys (stale answers silently ship).
If you cache incorrectly, you can leak one user’s content to another. Treat cache design as a security problem, not just a performance trick.
Cache keys and correctness (version everything)
The cache key must include everything that changes the answer:
- User context: tenant, role, permissions filters.
- Prompt version: prompt template id/version.
- Model version: model name and settings that affect output.
- Corpus/version: doc hash, index version, embedding version.
- Retrieval parameters: top-k, filters, reranker settings.
If you don’t include versions, you will serve stale outputs after updates.
Privacy and multi-tenant safety
Rules that reduce risk:
- Partition caches by tenant: separate namespaces or separate stores.
- Never cache secrets: redact before caching or avoid caching those outputs.
- Cache ids, not content: store chunk ids and retrieval decisions, not full text, when possible.
- Encrypt at rest: if caching contains sensitive derived data.
- Log access to sensitive caches: treat it like any other sensitive store.
TTL and invalidation strategies
Two main approaches:
- TTL-based: cache expires after N minutes/hours.
- Version-based: cache key includes version; new versions naturally miss cache.
Version-based invalidation is usually safer for correctness. TTL is useful for cost control and protecting against memory growth.
Caching patterns by pipeline stage
Retrieval caching
- Cache top-k chunk ids for frequent queries.
- Key includes: query, filters, index version, embedding version.
- Benefit: reduces vector DB load and latency.
Generation caching
- Cache final answers for repeated identical prompts in identical contexts.
- Key includes: prompt version, sources/chunk ids, model version, user context.
- Benefit: large cost savings for repeated questions.
For grounded systems, caching based on “sources set” is often safer than caching based only on “question string.”
Artifact caching
- Cache doc summaries, constraint extracts, chunk indexes.
- Invalidate when doc_hash changes.
- Benefit: reduces repeated long-context processing.