Part IX — Testing, Evaluation, and Quality (The Adult Supervision Layer)
Overview and links for this section of the guide.
On this page
What this part is for
Part IX is the “adult supervision layer.” It’s where you stop treating LLM behavior as a mysterious craft and start treating it like any other production dependency:
- it fails in predictable ways,
- it needs tests,
- it needs evaluation,
- it needs reliability engineering,
- and it needs observability.
If your product depends on model output, quality is not a nice-to-have. It’s the difference between “cool demo” and “something users trust.”
For AI features, correctness is rarely “provable” for all inputs. Your job is to define what can be tested deterministically, what must be evaluated statistically, and how to detect regressions fast.
The mindset shift: from “prompting” to “quality systems”
Prompting is part of the solution, but it’s not the solution.
A robust AI feature has:
- contracts: schemas, invariants, and “not found” policies,
- tests: deterministic checks for what must not break,
- evals: small curated sets that represent real use,
- reliability patterns: timeouts, retries, fallbacks, circuit breakers,
- observability: logs and traces that make failures debuggable.
This part gives you practical patterns you can apply immediately—without needing a giant ML platform.
The failure modes you are designing against
Most AI feature failures fall into a few categories:
- Format failure: invalid JSON, missing fields, wrong types, hallucinated keys.
- Constraint failure: ignores instructions, exceeds length, breaks policy rules.
- Content failure: wrong facts, wrong reasoning, wrong math, missing exceptions.
- Grounding failure: claims not supported by sources (especially in RAG).
- Reliability failure: timeouts, flaky outputs, rate limits, upstream outages.
- Regression failure: prompt/model changes silently change behavior.
The themes: detect early, fail safely, and make the system explain itself.
What you’ll be able to do after Part IX
- Choose the right testing strategy for a given AI feature (unit tests vs evals vs both).
- Build “golden tests” for structured outputs that lock in contracts.
- Use property-based and fuzz testing to harden against weird inputs and prompt injection attempts.
- Build a small evaluation harness that you can run on every prompt version.
- Score outputs with rubrics and pairwise comparisons to make improvements measurable.
- Design reliability patterns (timeouts, retries, fallbacks) that keep your app usable when models fail.
- Instrument your app so you can debug “why did it do that?” without guessing.
- Establish “ship points” for AI features that are based on evidence, not vibes.
The quality toolbox (tests, evals, reliability)
Use this toolbox as your mental model:
- Tests: deterministic checks for invariants (schema, safety constraints, critical behaviors).
- Evals: statistical or rubric-based measurement for quality and usefulness.
- Reliability: patterns that handle failure gracefully (timeouts, retries, circuit breakers, caching).
Most mature systems use all three.
Teams over-invest in prompts and under-invest in evals. The result: changes feel like progress until a user finds the regression. Evals are how you stop flying blind.
Part IX map (Sections 27–29)
- 27. Testing AI Features Like a Real Engineer
- 28. Evaluation Harnesses (Small to Serious)
- 29. Reliability Engineering for LLM Apps
Where to go next
Explore next