16.4 Tool error handling (when APIs fail)

Overview and links for this section of the guide.

Goal: tools fail safely and predictably

Tools will fail because networks fail, APIs rate limit, credentials expire, and inputs are messy. Your goal is to make tool failure:

  • categorized: retryable vs non-retryable
  • bounded: no infinite retries
  • auditable: you can see what was attempted
  • user-friendly: clear recovery paths
Tool error handling is product reliability

If tools fail unpredictably, your app will feel flaky. With a clear taxonomy and policies, tool failures become manageable.

Tool error taxonomy

Use a simple taxonomy with explicit retryability:

  • invalid_input: schema validation failed (non-retryable)
  • not_found: requested resource doesn’t exist (non-retryable)
  • auth: missing/expired credentials (non-retryable until fixed)
  • rate_limit: throttled (retryable with backoff)
  • timeout: tool took too long (retryable sometimes)
  • transient: network blips (retryable)
  • unknown: unexpected failures (retryable maybe once)

Make “retryable” explicit in tool responses so your system doesn’t guess.

Retry rules (read vs write tools)

Read-only tools

Read tools can often be retried safely with caps and backoff.

Write tools

Write tools should not be auto-retried unless you have idempotency. Otherwise retries can duplicate actions.

Safe approach:

  • require an idempotency key,
  • retry only when the system can prove the operation is safe to retry,
  • otherwise surface a clear error and require human review.
Automatic retries + side effects = danger

Without idempotency, retries can create duplicate tickets, duplicate payments, or unintended updates.

Fallback patterns

Fallbacks keep the user experience alive when a tool fails:

  • ask for clarification: if required input is missing
  • degrade gracefully: answer with partial information and a caveat
  • alternate tool: if one data source is down, use another read-only source
  • human escalation: for internal tools, hand off to a human with context

Fallback choice should depend on tool type and the user’s needs.

Circuit breakers and backpressure (practical)

If a tool or API is failing repeatedly, you should stop hammering it.

  • Circuit breaker: after N failures, stop calling the tool for a cooldown window.
  • Backpressure: return a “try later” response instead of queueing infinite work.
  • Queue limits: cap concurrency and reject overflow.

Even simple versions of these patterns dramatically improve stability.

User-facing behavior under tool failure

Users don’t care which tool failed. They care what to do next. Provide:

  • a clear message: “We couldn’t fetch X right now.”
  • a next step: “Try again” / “Try later” / “Provide order id”
  • a request id for support

Avoid exposing internal errors verbatim.

Logging and audit trails

For tool calls, log:

  • request id
  • tool name
  • sanitized parameters (no secrets)
  • latency
  • outcome category + retryable
  • idempotency key (for writes)

This enables debugging and accountability without leaking data.

Copy-paste templates

Template: tool error envelope

{
  "ok": false,
  "error": {
    "category": "rate_limit",
    "message": "Tool rate limited",
    "retryable": true
  }
}

Template: tool retry policy text

Tool retry policy:
- Retryable: rate_limit, transient, some timeouts
- Max attempts: 3
- Backoff: exponential + jitter
- Do not auto-retry write tools unless idempotency is guaranteed
- Log tool name, attempt count, category, latency

Where to go next