28.1 Build a tiny eval set that matters
Overview and links for this section of the guide.
On this page
- Goal: 25 great eval cases (not 10,000 mediocre ones)
- Principles of a high-leverage eval set
- Where eval cases come from (real > invented)
- Coverage map: what you must include
- Case format (what to store per case)
- Workflow: build, run, refine
- Maintaining the eval set over time
- Copy-paste prompts
- Where to go next
Goal: 25 great eval cases (not 10,000 mediocre ones)
A tiny eval set is the fastest path to measurable improvement.
The goal is a set of cases that:
- represents real user intent,
- covers your riskiest failure modes,
- is small enough to run frequently,
- is strong enough to catch regressions.
25 is large enough to cover variety and small enough that people will actually run it and review diffs. Scale later.
Principles of a high-leverage eval set
- Realistic: cases look like actual user requests (language, ambiguity, messiness).
- High-signal: each case teaches you something; avoid filler.
- Risk-weighted: include high-impact cases (where wrong answers are costly).
- Edge-aware: include tricky cases (exceptions, conflicts, negations, “not found”).
- Stable: cases don’t change every week; stability makes regressions detectable.
- Actionable: failures point to a fixable layer (retrieval, prompt, validator, UX policy).
Where eval cases come from (real > invented)
Best sources:
- Production logs: anonymized user queries (with consent and redaction).
- Support tickets: what people actually ask and what confused them.
- Internal stakeholders: “top 10 questions we always get.”
- Known incidents: failures you never want to repeat.
If you must invent cases, use a constraint: each invented case must correspond to a plausible real user scenario and be labeled as “synthetic.”
If eval cases are written in perfect prompt-engineer language, they won’t catch real failures. Real inputs are messy.
Coverage map: what you must include
Use a coverage map to ensure variety. For most AI features, include:
- Happy path: common cases that should work well.
- Ambiguity: questions that should trigger clarification.
- Not found: questions outside the corpus/task scope.
- Edge constraints: long inputs, short inputs, unusual formatting.
- High-risk: policy/compliance/security sensitive cases.
- Adversarial: injection-like attempts and format breakers.
- Conflicts (RAG): contradictory sources that should be surfaced.
If your system has multiple modes, include a few cases per mode (summarize, extract, answer-with-sources, etc.).
Case format (what to store per case)
Store enough information to reproduce behavior. A practical case format includes:
- id: stable identifier.
- input: user query or task input.
- context: user role/tenant, app mode, constraints (if relevant).
- expected outcome type: answered / not_found / needs_clarification / conflict / refused.
- rubric focus: which dimensions matter (correctness, faithfulness, clarity).
- notes: why this case exists; what it’s testing.
- optional labels: expected relevant doc_ids/chunk_ids (for retrieval eval).
For RAG cases, you may also store:
- a frozen set of source chunks (for prompt-level eval), or
- expected relevant chunk ids (for retrieval-level eval).
Workflow: build, run, refine
- Collect 50 candidates: from real sources (logs/tickets).
- Cluster them: group by intent and risk.
- Select 25: maximize coverage and risk-weighting.
- Write short notes: “what this catches.”
- Run your system: capture outputs and failures.
- Refine cases: remove redundant ones, add missing edges, keep the set small.
The purpose is iteration: the eval set becomes your product’s “quality map.”
Maintaining the eval set over time
Rules that keep the eval set useful:
- Keep it curated: don’t let it grow without a reason.
- Promote failures: any serious production failure becomes an eval case.
- Retire obsolete cases: when product scope changes, remove cases that no longer apply.
- Track versions: prompt version, model version, corpus version; otherwise results are confusing.
Copy-paste prompts
Prompt: propose eval cases from requirements
Help me build a small eval set (25 cases) for an AI feature.
Feature description: ...
Primary user tasks: ...
Risks if wrong: ...
Possible system states: answered / not_found / needs_clarification / conflict / refused
Task:
1) Propose 25 eval cases as a table with:
- id
- input
- expected outcome type
- why it matters (risk/edge case)
2) Ensure coverage across: happy path, ambiguity, not found, adversarial, high-risk.