29.1 Timeouts, retries, and idempotency

On this page

Goal: keep the app responsive under model variability
Timeouts: what to time out (everything)
Retries: when they help and when they hurt
Idempotency: safe retries without duplicate side effects
Budgets: max retries, max tokens, max latency
Practical patterns (wrappers and policies)
Where to go next

Goal: keep the app responsive under model variability

Model calls can be slow, flaky, rate-limited, or temporarily unavailable.

Your goal is not “make the model never fail.” Your goal is:

the app never hangs,
failures are bounded and recoverable,
retries don’t create duplicate side effects,
users get a useful fallback when the model can’t respond.

Timeouts: what to time out (everything)

Every external step should have a timeout:

Retrieval timeout: vector search, reranking, keyword search.
Model timeout: completion call (including streaming).
Tool/API timeout: any downstream API calls initiated by the app.
End-to-end timeout: total time budget for the user request.

Practical rules:

Set an end-to-end budget first: e.g., 8s for “interactive,” 30s for “analysis.”
Allocate per-stage budgets: retrieval 1s, rerank 1s, model 5s, validation 0.2s (example).
Reserve time for fallbacks: don’t spend 100% of the budget on retries.

Timeouts need cancellation

A timeout that doesn’t cancel work still burns cost. Use cancellation signals where supported and ensure background work is stopped or ignored safely.

Retries: when they help and when they hurt

Retries are helpful when failures are transient:

network hiccups,
rate-limit responses (after waiting),
temporary provider errors.

Retries are harmful when failures are persistent or logical:

invalid requests,
schema violations caused by prompt design,
permissions problems,
bad retrieval returning irrelevant chunks.

Practical retry rules:

Retry only a small number of times: 1–2 is often enough.
Use exponential backoff + jitter: avoid synchronized retry storms.
Differentiate errors: retry on 429/5xx/timeouts, not on 4xx “bad request.”
Retry with a modified strategy: fewer chunks, stricter schema reminder, or a smaller model.

Retry ≠ repeat

Good retries change something: wait longer, reduce prompt size, switch model, or fall back. Repeating the same call often repeats the same failure.

Idempotency: safe retries without duplicate side effects

Idempotency answers: “If we repeat this request, will we accidentally do it twice?”

For pure generation (text output), retries are usually safe. For actions (tool calls, DB writes), retries can be dangerous.

Idempotency patterns:

Idempotency key: attach a unique request id to downstream writes so duplicates are rejected.
Read vs write split: allow retries for reads; require special handling for writes.
Two-phase commit style: generate a proposal first, then require explicit execution.
At-least-once vs exactly-once: design your system to tolerate duplicates when exactly-once is hard.

Streaming complicates retries

If you stream partial output to users, you need policies for mid-stream failure: restart from scratch, resume, or fall back to a summary. Decide this upfront.

Budgets: max retries, max tokens, max latency

Reliability requires hard limits:

Max retries: bound how many attempts you make.
Max tokens: cap output size and total token usage per request.
Max context: cap number of retrieved chunks and total context included.
Max time: end-to-end deadline that includes retries.

Budgets keep your system stable during outages and protect you from surprise bills.

Practical patterns (wrappers and policies)

Most teams end up with a wrapper like:

compose prompt (with budgets),
call model (with timeout),
validate output (schema, citations),
retry with stricter prompt or reduced context,
fallback to not_found / needs_clarification / degraded response.

Make this wrapper consistent across the codebase. Reliability comes from standardization.

29.1 Timeouts, retries, and idempotency

Goal: keep the app responsive under model variability

Timeouts: what to time out (everything)

Retries: when they help and when they hurt

Idempotency: safe retries without duplicate side effects

Budgets: max retries, max tokens, max latency

Practical patterns (wrappers and policies)

Where to go next