28.5 Human review workflows that don't waste time

Overview and links for this section of the guide.

Goal: human review that scales without burning the team

Human review is expensive. If it’s painful, teams stop doing it. If they stop doing it, quality drifts and users find failures.

Your goal is a workflow that is:

  • efficient: minimal time per decision,
  • consistent: reviewers align on what “good” means,
  • actionable: failures become fixes and eval cases,
  • safe: privacy and permissions are respected.

Why humans are still required

Humans are still needed for:

  • correctness in nuanced domains: policy interpretation, edge cases.
  • faithfulness judgments: does the citation truly support the claim?
  • product usefulness: would a user find this helpful?
  • tone and brand constraints: especially for customer-facing outputs.
  • high-risk decisions: anything with safety, legal, or security impact.

Automated checks can catch structure and obvious failures. Humans catch the subtle ones.

Design principles for efficient review

  • Review diffs, not raw runs: focus attention where behavior changed.
  • Review a sample: you don’t need to review everything every time.
  • Use a short rubric: 3–5 dimensions; anchor definitions.
  • Make “unclear” a valid outcome: reviewers can request clarifications or label as “needs product decision.”
  • Close the loop: every review produces an action (ship, iterate, add case, update docs).
Review should create artifacts

The output of review is not a feeling. It’s a decision and a set of labeled failures that become regression cases.

A practical review workflow (queue + rubric + decisions)

One reliable pattern:

  1. Queue creation: collect evaluation runs or production samples into a review queue.
  2. Pre-filter: automated gates mark outputs as “invalid” (schema/citation failures).
  3. Human scoring: reviewers score remaining outputs with a short rubric.
  4. Decision stage: decide ship/rollback/iterate based on results and gates.
  5. Promotion stage: add the worst failures to the eval set and fuzz corpus.

To reduce time, prioritize:

  • cases that regressed,
  • high-risk cases,
  • cases with low confidence or conflicts,
  • cases flagged by automated checks.

Roles and responsibilities

Small teams can combine roles, but the responsibilities still exist:

  • Owner: decides what ships; owns the rubric and gates.
  • Reviewers: score outputs consistently and record failure notes.
  • Engineer: turns failures into fixes (prompt, retrieval, validation, UX policies).
  • Domain expert (optional): reviews high-risk or nuanced cases (policy/legal/security).

Calibration and inter-reviewer consistency

Review quality improves with calibration:

  • Calibration session: reviewers score the same 5 cases, then discuss differences.
  • Anchor examples: keep a few “this is a 0/1/2” examples per rubric dimension.
  • Disagreement handling: use a tie-breaker or escalate to owner for high-impact decisions.
Prefer pairwise for subjective decisions

If reviewers struggle with absolute scores, use pairwise comparisons (28.3). It’s often faster and more consistent.

Privacy and sensitive data handling

Human review can accidentally spread sensitive data. Build safeguards:

  • Redaction: remove PII/secrets before cases enter review.
  • Access control: reviewers must have permission to see the data.
  • Retention: keep review artifacts only as long as needed.
  • Auditability: log access to sensitive review queues when required.

Where to go next