Home/ Part IX — Testing, Evaluation, and Quality (The Adult Supervision Layer)/28. Evaluation Harnesses (Small to Serious)

28. Evaluation Harnesses (Small to Serious)

Overview and links for this section of the guide.

On this page

What this section is for
What “eval” means in practice
A simple eval workflow you can start today
Ingredients of a good eval harness
Section 28 map (28.1–28.5)
Where to start

What this section is for

Section 28 teaches you how to build evaluation harnesses that make AI feature quality measurable.

An eval harness answers questions like:

“Did this prompt change make things better or worse?”
“Which model is better for this workload?”
“Are we regressing on important edge cases?”
“Do humans agree our outputs are acceptable?”

Evals are not “big ML infra”

A great eval harness can be a folder of JSON files + a script + a rubric. What matters is that it represents real usage and you run it consistently.

What “eval” means in practice

For most products, eval means:

collect a small set of representative prompts/inputs,
run the system (prompt + retrieval + model + validators),
score outputs using a rubric (human, model-assisted, or both),
track results over time and detect regressions.

It’s closer to product QA than it is to “training.”

A simple eval workflow you can start today

Write 25 eval cases that represent real questions.
Define a rubric (correctness, faithfulness, clarity, safety).
Run your system for every case and store outputs.
Review diffs when you change prompts/models.
Promote failures into regression cases you never want to repeat.

That is enough to make iterative improvement real.

Ingredients of a good eval harness

Eval set: small, curated, representative.
Scoring: rubrics, pairwise comparisons, and/or automated checks.
Versioning: prompt versions, model versions, corpus versions.
Regression detection: alerts when key metrics change.
Human review loop: efficient review that doesn’t burn the team.
Artifacts: stored outputs and reasons so you can learn from failures.

Section 28 map (28.1–28.5)

Where to start

Explore next

28. Evaluation Harnesses (Small to Serious) sub-sections

5 pages

28.1 Build a tiny eval set that matters

Open page

28.2 Scoring outputs with rubrics

Open page

28.3 Pairwise comparisons for model/prompt tuning

Open page

28.4 Regression detection across prompt versions

Open page

28.5 Human review workflows that don't waste time

Open page