27.2 Golden tests for structured outputs

Overview and links for this section of the guide.

Goal: lock in contracts with “golden” examples

Golden tests are “known good” examples that you keep forever. They prevent accidental drift when you change prompts, models, validators, or formatting.

For structured outputs, golden tests are extremely high leverage because they anchor:

  • schema shape,
  • field presence,
  • normalization rules,
  • edge-case behavior.

What golden tests are (and what they’re not)

Golden tests are:

  • fixtures: representative inputs (and sometimes sources) stored as files.
  • expected outputs: the canonical JSON (or normalized form) you expect.
  • diff-driven review: when something changes, you review the diff intentionally.

Golden tests are not:

  • a replacement for evaluation (they cover invariants and key cases, not overall quality),
  • an excuse to snapshot huge blobs without reviewing them.
Goldens encode your product decisions

When you choose how to handle missing fields, conflicts, “not found,” or error outputs, capture it in golden cases. That’s how the decision survives future changes.

Why golden tests work well for LLM features

LLM behavior can drift due to:

  • prompt tweaks,
  • model upgrades,
  • retrieval changes,
  • validator changes,
  • post-processing changes.

Golden tests catch “we didn’t mean to change this” problems early.

Designing golden tests for structured output

A good golden test suite covers:

  • happy path: common input that should work.
  • missing data: expected null fields and missing_info behavior.
  • invalid input: empty strings, garbage text, wrong language, huge input.
  • constraint edges: max bullet counts, max length, required citations.
  • refusal/escalation: restricted content or unsafe requests map to safe outputs.
  • conflicts: contradictory sources produce conflict-aware outputs.

For RAG, each golden case often includes:

  • question,
  • retrieved sources (chunk_id + text),
  • expected output JSON (including citations).

This makes your grounding contract testable.

Handling non-determinism (tolerant assertions)

If your system is truly deterministic (temperature 0, stable model, stable retrieval), you can compare full JSON outputs exactly.

In practice, you often need tolerance:

  • Canonicalization: sort keys, sort arrays by stable fields, normalize whitespace.
  • Partial matching: assert required fields and invariants, not exact phrasing.
  • Field-level assertions: compare only certain fields (e.g., not_found, used_chunk_ids).
  • Schema-first: validate schema strictly and then do targeted value checks.
  • Citation checks: verify cited chunk ids exist and quotes appear in chunk text.
Don’t tolerate away the meaning

If your tolerance is so broad that a wrong answer still “passes,” you’re not testing. Use golden tests for contracts and critical behaviors; use evals for quality.

A safe update workflow (avoid “approve everything”)

Golden tests only work if updates are intentional.

A safe update workflow:

  1. Changes produce diffs: golden outputs change in a PR.
  2. Reviewer must justify changes: why is it better, and what tradeoff changed?
  3. Update requires a ticket or note: tie changes to a decision (prompt change, bug fix, new requirement).
  4. Keep goldens small: avoid giant outputs that no one reviews.
  5. Pin the change: when you accept a new behavior, update the spec/rubric too.
Golden tests are “diff-only” discipline for behavior

If you force small, reviewable golden diffs, you prevent accidental drift and keep the system stable under iteration.

Example golden test cases (patterns)

Pattern: JSON extraction

  • Input: messy text.
  • Expected: valid JSON with required keys; unknown fields are null.
  • Assertions: parse + schema + required keys + enum constraints.

Pattern: grounded answer with citations

  • Input: question + 3 source chunks.
  • Expected: each claim cites chunk ids; quotes exist in chunk text.
  • Assertions: citation presence + quote containment + not-found behavior when evidence missing.

Pattern: refusal/escalation

  • Input: unsafe or restricted request.
  • Expected: status="refused" or "restricted" with safe alternative.
  • Assertions: no secrets, no disallowed content, correct routing fields populated.

Where to go next