Part IX — Testing, Evaluation, and Quality (The Adult Supervision Layer)

On this page

What this part is for
The mindset shift: from “prompting” to “quality systems”
The failure modes you are designing against
What you’ll be able to do after Part IX
The quality toolbox (tests, evals, reliability)
Part IX map (Sections 27–29)
Where to go next

What this part is for

Part IX is the “adult supervision layer.” It’s where you stop treating LLM behavior as a mysterious craft and start treating it like any other production dependency:

it fails in predictable ways,
it needs tests,
it needs evaluation,
it needs reliability engineering,
and it needs observability.

If your product depends on model output, quality is not a nice-to-have. It’s the difference between “cool demo” and “something users trust.”

Core idea

For AI features, correctness is rarely “provable” for all inputs. Your job is to define what can be tested deterministically, what must be evaluated statistically, and how to detect regressions fast.

The mindset shift: from “prompting” to “quality systems”

Prompting is part of the solution, but it’s not the solution.

A robust AI feature has:

contracts: schemas, invariants, and “not found” policies,
tests: deterministic checks for what must not break,
evals: small curated sets that represent real use,
reliability patterns: timeouts, retries, fallbacks, circuit breakers,
observability: logs and traces that make failures debuggable.

This part gives you practical patterns you can apply immediately—without needing a giant ML platform.

The failure modes you are designing against

Most AI feature failures fall into a few categories:

Format failure: invalid JSON, missing fields, wrong types, hallucinated keys.
Constraint failure: ignores instructions, exceeds length, breaks policy rules.
Content failure: wrong facts, wrong reasoning, wrong math, missing exceptions.
Grounding failure: claims not supported by sources (especially in RAG).
Reliability failure: timeouts, flaky outputs, rate limits, upstream outages.
Regression failure: prompt/model changes silently change behavior.

The themes: detect early, fail safely, and make the system explain itself.

What you’ll be able to do after Part IX

Choose the right testing strategy for a given AI feature (unit tests vs evals vs both).
Build “golden tests” for structured outputs that lock in contracts.
Use property-based and fuzz testing to harden against weird inputs and prompt injection attempts.
Build a small evaluation harness that you can run on every prompt version.
Score outputs with rubrics and pairwise comparisons to make improvements measurable.
Design reliability patterns (timeouts, retries, fallbacks) that keep your app usable when models fail.
Instrument your app so you can debug “why did it do that?” without guessing.
Establish “ship points” for AI features that are based on evidence, not vibes.

The quality toolbox (tests, evals, reliability)

Use this toolbox as your mental model:

Tests: deterministic checks for invariants (schema, safety constraints, critical behaviors).
Evals: statistical or rubric-based measurement for quality and usefulness.
Reliability: patterns that handle failure gracefully (timeouts, retries, circuit breakers, caching).

Most mature systems use all three.

A common trap

Teams over-invest in prompts and under-invest in evals. The result: changes feel like progress until a user finds the regression. Evals are how you stop flying blind.

Part IX — Testing, Evaluation, and Quality (The Adult Supervision Layer)

What this part is for

The mindset shift: from “prompting” to “quality systems”

The failure modes you are designing against

What you’ll be able to do after Part IX

The quality toolbox (tests, evals, reliability)

Part IX map (Sections 27–29)

Where to go next

Part IX — Testing, Evaluation, and Quality (The Adult Supervision Layer) sub-sections