28.5 Human review workflows that don't waste time

On this page

Goal: human review that scales without burning the team

Human review is expensive. If it’s painful, teams stop doing it. If they stop doing it, quality drifts and users find failures.

Your goal is a workflow that is:

Humans are still needed for:

Automated checks can catch structure and obvious failures. Humans catch the subtle ones.

Review diffs, not raw runs: focus attention where behavior changed.
Review a sample: you don’t need to review everything every time.
Use a short rubric: 3–5 dimensions; anchor definitions.
Make “unclear” a valid outcome: reviewers can request clarifications or label as “needs product decision.”
Close the loop: every review produces an action (ship, iterate, add case, update docs).

Review should create artifacts

The output of review is not a feeling. It’s a decision and a set of labeled failures that become regression cases.

One reliable pattern:

Queue creation: collect evaluation runs or production samples into a review queue.
Pre-filter: automated gates mark outputs as “invalid” (schema/citation failures).
Human scoring: reviewers score remaining outputs with a short rubric.
Decision stage: decide ship/rollback/iterate based on results and gates.
Promotion stage: add the worst failures to the eval set and fuzz corpus.

To reduce time, prioritize:

Small teams can combine roles, but the responsibilities still exist:

Owner: decides what ships; owns the rubric and gates.
Reviewers: score outputs consistently and record failure notes.
Engineer: turns failures into fixes (prompt, retrieval, validation, UX policies).
Domain expert (optional): reviews high-risk or nuanced cases (policy/legal/security).

Review quality improves with calibration:

Calibration session: reviewers score the same 5 cases, then discuss differences.
Anchor examples: keep a few “this is a 0/1/2” examples per rubric dimension.
Disagreement handling: use a tie-breaker or escalate to owner for high-impact decisions.

Prefer pairwise for subjective decisions

If reviewers struggle with absolute scores, use pairwise comparisons (28.3). It’s often faster and more consistent.

Human review can accidentally spread sensitive data. Build safeguards: