29.1 Timeouts, retries, and idempotency
Overview and links for this section of the guide.
On this page
Goal: keep the app responsive under model variability
Model calls can be slow, flaky, rate-limited, or temporarily unavailable.
Your goal is not “make the model never fail.” Your goal is:
- the app never hangs,
- failures are bounded and recoverable,
- retries don’t create duplicate side effects,
- users get a useful fallback when the model can’t respond.
Timeouts: what to time out (everything)
Every external step should have a timeout:
- Retrieval timeout: vector search, reranking, keyword search.
- Model timeout: completion call (including streaming).
- Tool/API timeout: any downstream API calls initiated by the app.
- End-to-end timeout: total time budget for the user request.
Practical rules:
- Set an end-to-end budget first: e.g., 8s for “interactive,” 30s for “analysis.”
- Allocate per-stage budgets: retrieval 1s, rerank 1s, model 5s, validation 0.2s (example).
- Reserve time for fallbacks: don’t spend 100% of the budget on retries.
A timeout that doesn’t cancel work still burns cost. Use cancellation signals where supported and ensure background work is stopped or ignored safely.
Retries: when they help and when they hurt
Retries are helpful when failures are transient:
- network hiccups,
- rate-limit responses (after waiting),
- temporary provider errors.
Retries are harmful when failures are persistent or logical:
- invalid requests,
- schema violations caused by prompt design,
- permissions problems,
- bad retrieval returning irrelevant chunks.
Practical retry rules:
- Retry only a small number of times: 1–2 is often enough.
- Use exponential backoff + jitter: avoid synchronized retry storms.
- Differentiate errors: retry on 429/5xx/timeouts, not on 4xx “bad request.”
- Retry with a modified strategy: fewer chunks, stricter schema reminder, or a smaller model.
Good retries change something: wait longer, reduce prompt size, switch model, or fall back. Repeating the same call often repeats the same failure.
Idempotency: safe retries without duplicate side effects
Idempotency answers: “If we repeat this request, will we accidentally do it twice?”
For pure generation (text output), retries are usually safe. For actions (tool calls, DB writes), retries can be dangerous.
Idempotency patterns:
- Idempotency key: attach a unique request id to downstream writes so duplicates are rejected.
- Read vs write split: allow retries for reads; require special handling for writes.
- Two-phase commit style: generate a proposal first, then require explicit execution.
- At-least-once vs exactly-once: design your system to tolerate duplicates when exactly-once is hard.
If you stream partial output to users, you need policies for mid-stream failure: restart from scratch, resume, or fall back to a summary. Decide this upfront.
Budgets: max retries, max tokens, max latency
Reliability requires hard limits:
- Max retries: bound how many attempts you make.
- Max tokens: cap output size and total token usage per request.
- Max context: cap number of retrieved chunks and total context included.
- Max time: end-to-end deadline that includes retries.
Budgets keep your system stable during outages and protect you from surprise bills.
Practical patterns (wrappers and policies)
Most teams end up with a wrapper like:
- compose prompt (with budgets),
- call model (with timeout),
- validate output (schema, citations),
- retry with stricter prompt or reduced context,
- fallback to not_found / needs_clarification / degraded response.
Make this wrapper consistent across the codebase. Reliability comes from standardization.