16.4 Tool error handling (when APIs fail)
Overview and links for this section of the guide.
On this page
Goal: tools fail safely and predictably
Tools will fail because networks fail, APIs rate limit, credentials expire, and inputs are messy. Your goal is to make tool failure:
- categorized: retryable vs non-retryable
- bounded: no infinite retries
- auditable: you can see what was attempted
- user-friendly: clear recovery paths
If tools fail unpredictably, your app will feel flaky. With a clear taxonomy and policies, tool failures become manageable.
Tool error taxonomy
Use a simple taxonomy with explicit retryability:
- invalid_input: schema validation failed (non-retryable)
- not_found: requested resource doesn’t exist (non-retryable)
- auth: missing/expired credentials (non-retryable until fixed)
- rate_limit: throttled (retryable with backoff)
- timeout: tool took too long (retryable sometimes)
- transient: network blips (retryable)
- unknown: unexpected failures (retryable maybe once)
Make “retryable” explicit in tool responses so your system doesn’t guess.
Retry rules (read vs write tools)
Read-only tools
Read tools can often be retried safely with caps and backoff.
Write tools
Write tools should not be auto-retried unless you have idempotency. Otherwise retries can duplicate actions.
Safe approach:
- require an idempotency key,
- retry only when the system can prove the operation is safe to retry,
- otherwise surface a clear error and require human review.
Without idempotency, retries can create duplicate tickets, duplicate payments, or unintended updates.
Fallback patterns
Fallbacks keep the user experience alive when a tool fails:
- ask for clarification: if required input is missing
- degrade gracefully: answer with partial information and a caveat
- alternate tool: if one data source is down, use another read-only source
- human escalation: for internal tools, hand off to a human with context
Fallback choice should depend on tool type and the user’s needs.
Circuit breakers and backpressure (practical)
If a tool or API is failing repeatedly, you should stop hammering it.
- Circuit breaker: after N failures, stop calling the tool for a cooldown window.
- Backpressure: return a “try later” response instead of queueing infinite work.
- Queue limits: cap concurrency and reject overflow.
Even simple versions of these patterns dramatically improve stability.
User-facing behavior under tool failure
Users don’t care which tool failed. They care what to do next. Provide:
- a clear message: “We couldn’t fetch X right now.”
- a next step: “Try again” / “Try later” / “Provide order id”
- a request id for support
Avoid exposing internal errors verbatim.
Logging and audit trails
For tool calls, log:
- request id
- tool name
- sanitized parameters (no secrets)
- latency
- outcome category + retryable
- idempotency key (for writes)
This enables debugging and accountability without leaking data.
Copy-paste templates
Template: tool error envelope
{
"ok": false,
"error": {
"category": "rate_limit",
"message": "Tool rate limited",
"retryable": true
}
}
Template: tool retry policy text
Tool retry policy:
- Retryable: rate_limit, transient, some timeouts
- Max attempts: 3
- Backoff: exponential + jitter
- Do not auto-retry write tools unless idempotency is guaranteed
- Log tool name, attempt count, category, latency