13.3 Basic request/response wrapper architecture

Overview and links for this section of the guide.

Goal: isolate model calls behind one boundary

As soon as your app makes model calls, you need a boundary so the rest of your code doesn’t turn into “prompt glue.”

The goal of this chapter is to build a single module (or small package) that is the only place that:

  • knows how to call the model API,
  • knows about retries/timeouts,
  • knows about schemas and output validation,
  • knows how to log model-call metadata safely.

Everything else should treat the model call like a normal dependency with a small interface.

This is how you keep architectural control

If model calls are scattered across handlers and controllers, your app becomes hard to debug, hard to test, and expensive to change. A wrapper boundary prevents that.

Why wrapper architecture matters

LLM calls behave differently from normal library functions:

  • they can be slow,
  • they can fail transiently (rate limits, timeouts),
  • they can fail “semantically” (invalid JSON, wrong schema),
  • they can be blocked/refused (safety behavior),
  • they can be expensive.

If you don’t isolate those behaviors, they leak into every part of your codebase.

A practical layering model

A simple, durable architecture:

  • UI / entrypoints: CLI, HTTP handlers, jobs. They handle input/output and UX.
  • Domain layer: your product logic (what the app is trying to do).
  • LLM adapter: the wrapper client that calls the model and returns structured results.
  • Infrastructure: config, logging, persistence, caching.

The key is that domain logic should not contain raw prompt strings or provider-specific API calls.

Avoid “prompt soup”

When prompts are embedded inline across the app, you can’t version them, test them, or evolve them safely. Centralize prompt usage.

Design the wrapper interface (inputs/outputs)

Your wrapper should expose a small, typed interface. The interface you choose is more important than the provider API details.

Inputs you typically want

  • task name / prompt id: which prompt template to use,
  • task inputs: the user data (text, params),
  • output mode: free text vs structured (schema),
  • options: timeout, retries, temperature overrides, model override.

Outputs you typically want

Prefer returning a structured response object rather than raw text. A practical response includes:

  • status: ok / blocked / invalid_output / timeout / rate_limit / error
  • result: parsed structured data (or text)
  • raw_text: optional (for debugging in dev)
  • metadata: request id, prompt version, model, latency, token estimates
  • error: categorized error info safe to log and show
Treat the model call like an RPC

Define an explicit request/response contract for your model boundary. That’s what makes the system testable and resilient.

Prompt packaging (versions, metadata, structure)

Your wrapper should not build prompts ad-hoc in random places. It should:

  • load prompt templates from files,
  • inject task inputs into a template safely,
  • include prompt version ids in logs and responses,
  • separate “system/house rules” from “task spec” (Part IV Section 10).

This is how you prevent prompt drift and make outputs reproducible.

Validation: treat outputs as untrusted

Even if the model is “usually right,” outputs are untrusted input:

  • validate JSON if you asked for JSON,
  • validate schemas (required fields, enums, types),
  • handle partial/invalid output gracefully (retry or fallback),
  • never assume “it will always follow instructions.”

Structured output and validation get a full chapter later (Part V Section 15), but your wrapper should be designed to support it from day one.

Validation belongs at the boundary

Don’t push validation into every caller. Do it once in the wrapper and return a typed, validated result or a clear error.

Testing: dependency injection and fakes

The fastest way to make AI apps testable is to make the LLM adapter injectable.

  • Your domain logic depends on an interface like Summarizer, not on an API client.
  • In tests, you use a fake summarizer that returns deterministic outputs.
  • In production, you bind the real LLM client implementation.

This lets you test your app without making real model calls (faster, cheaper, deterministic).

Don’t put real model calls in unit tests

They’re slow, flaky, and expensive. Save real calls for manual testing and evaluation harnesses.

Recommended file structure

A small, scalable structure:

src/
  app/                # domain logic (no provider imports)
    summarize.py
  llm/                # model boundary
    client.py
    prompts/
      system.md
      summarize_v1.md
    schemas/
      summarize_v1.json
  config.py
  logging.py
tests/
  test_app_summarize.py   # uses a fake LLM client
  test_llm_schema.py      # validates schema enforcement

Copy-paste templates

Template: wrapper interface sketch

LLMRequest:
- prompt_id
- prompt_version
- inputs (task data)
- output_schema (optional)
- options (timeout, retries, model override)

LLMResponse:
- status (ok/blocked/invalid_output/timeout/rate_limit/error)
- result (typed) or null
- raw_text (optional)
- metadata (request_id, model, latency, token_estimate)
- error (category + message)

Template: boundary rule (paste into prompts)

Architecture rule:
- All model/API calls must go through `src/llm/client.*`.
- No other modules may call the provider SDK directly.
- Callers must handle `status != ok` outcomes explicitly.

Where to go next