25.4 Evaluation: measuring answer quality and faithfulness

Overview and links for this section of the guide.

Goal: measure quality and catch regressions

Without evaluation, you can’t improve a RAG system reliably.

RAG evaluation has three layers:

  • Retrieval: did we fetch the right evidence?
  • Answer quality: did we answer the question clearly and correctly?
  • Faithfulness: are claims actually supported by citations?
Faithfulness is the core promise of RAG

A RAG system that isn’t faithful is worse than plain chat: it gives users false confidence while claiming it has “sources.”

Quality dimensions: retrieval, answer, faithfulness

Track these as separate signals:

  • Retrieval recall: at least one relevant chunk appears in top-k.
  • Answer correctness: the answer matches what the docs say.
  • Citation correctness: citations actually support each claim.
  • Abstention correctness: “not found” triggers when evidence is missing.
  • Latency/cost: performance regressions are real regressions.

Building an eval set that matters

Start with 25–50 questions. Each item should include:

  • question: phrased like a real user,
  • expected outcome: answerable vs not found,
  • notes: what makes it tricky (exceptions, conflicts, ambiguity),
  • optional labels: known relevant doc_ids or chunk_ids.

Include variety:

  • common questions (high frequency),
  • high-risk questions (high cost of wrong answer),
  • edge cases (negations, contradictions, version changes),
  • “not found” questions (must abstain).

Automated checks (fast wins)

You can automate a lot without perfect labels:

  • Schema validation: every output parses and matches schema.
  • Citation presence: every claim has a citation.
  • Chunk id validity: cited ids exist in the retrieved set.
  • Quote containment: cited quote text appears in the chunk text (strong faithfulness signal).
  • Not found rate: track how often the system abstains; sudden changes are suspicious.
Use “quote containment” as a cheap integrity check

If you require the model to include short quotes, you can verify those quotes exist in the chunk text. It doesn’t prove correctness, but it catches many hallucinations.

Human review workflow (efficient, not painful)

Human review is necessary for correctness and nuance. Make it cheap:

  • Sample: review 10–20 answers per change, not everything.
  • Focus on faithfulness: do citations support claims?
  • Use a rubric: correct / partially correct / incorrect; faithful / unfaithful; clear / unclear.
  • Record failures: add them to the eval set as regression cases.

Regression testing across changes

Things that should trigger running the eval set:

  • chunking rule changes,
  • embedding model changes,
  • retrieval/ranking logic changes,
  • prompt template changes,
  • corpus refreshes (large doc updates).

Treat each as a code change: run evals, review diffs, and only then ship.

Copy-paste prompts (grading and labeling)

Prompt: grade faithfulness and correctness

Grade the following answer for a grounded Q&A system.

You will receive:
- Question
- Answer (with citations and quotes)
- Source chunks (chunk_id + text)

Task:
1) Rate correctness: correct / partially correct / incorrect.
2) Rate faithfulness: faithful / partially faithful / unfaithful.
3) Identify any claims not supported by the cited quotes.
4) Suggest the smallest fix (retrieval, chunking, prompt) to prevent this failure.

Return a short rubric report.

Where to go next