16.4 Tool error handling (when APIs fail)

On this page

Goal: tools fail safely and predictably
Tool error taxonomy
Retry rules (read vs write tools)
Fallback patterns
Circuit breakers and backpressure (practical)
User-facing behavior under tool failure
Logging and audit trails
Copy-paste templates
Where to go next

Goal: tools fail safely and predictably

Tools will fail because networks fail, APIs rate limit, credentials expire, and inputs are messy. Your goal is to make tool failure:

categorized: retryable vs non-retryable
bounded: no infinite retries
auditable: you can see what was attempted
user-friendly: clear recovery paths

Tool error handling is product reliability

If tools fail unpredictably, your app will feel flaky. With a clear taxonomy and policies, tool failures become manageable.

Tool error taxonomy

Use a simple taxonomy with explicit retryability:

invalid_input: schema validation failed (non-retryable)
not_found: requested resource doesn’t exist (non-retryable)
auth: missing/expired credentials (non-retryable until fixed)
rate_limit: throttled (retryable with backoff)
timeout: tool took too long (retryable sometimes)
transient: network blips (retryable)
unknown: unexpected failures (retryable maybe once)

Make “retryable” explicit in tool responses so your system doesn’t guess.

Retry rules (read vs write tools)

Read-only tools

Read tools can often be retried safely with caps and backoff.

Write tools

Write tools should not be auto-retried unless you have idempotency. Otherwise retries can duplicate actions.

Safe approach:

require an idempotency key,
retry only when the system can prove the operation is safe to retry,
otherwise surface a clear error and require human review.

Automatic retries + side effects = danger

Without idempotency, retries can create duplicate tickets, duplicate payments, or unintended updates.

Fallback patterns

Fallbacks keep the user experience alive when a tool fails:

ask for clarification: if required input is missing
degrade gracefully: answer with partial information and a caveat
alternate tool: if one data source is down, use another read-only source
human escalation: for internal tools, hand off to a human with context

Fallback choice should depend on tool type and the user’s needs.

Circuit breakers and backpressure (practical)

If a tool or API is failing repeatedly, you should stop hammering it.

Circuit breaker: after N failures, stop calling the tool for a cooldown window.
Backpressure: return a “try later” response instead of queueing infinite work.
Queue limits: cap concurrency and reject overflow.

Even simple versions of these patterns dramatically improve stability.

User-facing behavior under tool failure

Users don’t care which tool failed. They care what to do next. Provide:

a clear message: “We couldn’t fetch X right now.”
a next step: “Try again” / “Try later” / “Provide order id”
a request id for support

Avoid exposing internal errors verbatim.

Logging and audit trails

For tool calls, log:

request id
tool name
sanitized parameters (no secrets)
latency
outcome category + retryable
idempotency key (for writes)

This enables debugging and accountability without leaking data.

Copy-paste templates

Template: tool error envelope

{
  "ok": false,
  "error": {
    "category": "rate_limit",
    "message": "Tool rate limited",
    "retryable": true
  }
}

Template: tool retry policy text

Tool retry policy:
- Retryable: rate_limit, transient, some timeouts
- Max attempts: 3
- Backoff: exponential + jitter
- Do not auto-retry write tools unless idempotency is guaranteed
- Log tool name, attempt count, category, latency

16.4 Tool error handling (when APIs fail)

Goal: tools fail safely and predictably

Tool error taxonomy

Retry rules (read vs write tools)

Read-only tools

Write tools

Fallback patterns

Circuit breakers and backpressure (practical)

User-facing behavior under tool failure

Logging and audit trails

Copy-paste templates

Template: tool error envelope

Template: tool retry policy text

Where to go next