Home/
Part IX — Testing, Evaluation, and Quality (The Adult Supervision Layer)/28. Evaluation Harnesses (Small to Serious)
28. Evaluation Harnesses (Small to Serious)
Overview and links for this section of the guide.
On this page
What this section is for
Section 28 teaches you how to build evaluation harnesses that make AI feature quality measurable.
An eval harness answers questions like:
- “Did this prompt change make things better or worse?”
- “Which model is better for this workload?”
- “Are we regressing on important edge cases?”
- “Do humans agree our outputs are acceptable?”
Evals are not “big ML infra”
A great eval harness can be a folder of JSON files + a script + a rubric. What matters is that it represents real usage and you run it consistently.
What “eval” means in practice
For most products, eval means:
- collect a small set of representative prompts/inputs,
- run the system (prompt + retrieval + model + validators),
- score outputs using a rubric (human, model-assisted, or both),
- track results over time and detect regressions.
It’s closer to product QA than it is to “training.”
A simple eval workflow you can start today
- Write 25 eval cases that represent real questions.
- Define a rubric (correctness, faithfulness, clarity, safety).
- Run your system for every case and store outputs.
- Review diffs when you change prompts/models.
- Promote failures into regression cases you never want to repeat.
That is enough to make iterative improvement real.
Ingredients of a good eval harness
- Eval set: small, curated, representative.
- Scoring: rubrics, pairwise comparisons, and/or automated checks.
- Versioning: prompt versions, model versions, corpus versions.
- Regression detection: alerts when key metrics change.
- Human review loop: efficient review that doesn’t burn the team.
- Artifacts: stored outputs and reasons so you can learn from failures.
Section 28 map (28.1–28.5)
- 28.1 Build a tiny eval set that matters
- 28.2 Scoring outputs with rubrics
- 28.3 Pairwise comparisons for model/prompt tuning
- 28.4 Regression detection across prompt versions
- 28.5 Human review workflows that don’t waste time
Where to start
Explore next
28. Evaluation Harnesses (Small to Serious) sub-sections
5 pages