26.5 Auditing: storing which chunks influenced answers

Overview and links for this section of the guide.

Goal: make every answer explainable later

If a user asks “why did it say that?”, you should be able to answer with evidence:

  • which sources were retrieved,
  • which were included in the prompt,
  • which were cited in the answer,
  • which model/prompt versions were used.

This is how you debug issues and build trust.

Why auditing matters (beyond compliance)

Auditing is valuable even if you don’t have compliance requirements:

  • Debugging: reproduce failures and fix the right layer (retrieval vs prompt vs corpus).
  • Quality improvement: find common “not found” gaps and update docs.
  • Regression detection: identify when a new prompt version changed behavior.
  • Security: detect suspicious queries or attempted prompt injection patterns.
Audit logs can become a privacy liability

Log identifiers and hashes when possible. Be deliberate about storing raw user inputs and source text.

What to log (minimum viable)

At minimum, log:

  • request metadata: timestamp, request id, user/tenant id (or hashed), environment.
  • question: raw or redacted question text.
  • retrieval: query variants, filters used, top-k chunk ids and scores.
  • prompt version: identifier for the prompt template and rules.
  • model version: model name, parameters (temperature), safety settings.
  • answer: final JSON output plus validation status.
  • citations: which chunk ids were cited.

Optionally log:

  • token usage and latency breakdown,
  • reranking decisions,
  • “not found” missing_info fields,
  • conflict detection outputs.

Audit log schema (practical)

{
  "request_id": string,
  "timestamp": string,
  "actor": { "user_id": string|null, "tenant_id": string|null },
  "question": { "text": string, "redacted": boolean },
  "retrieval": {
    "filters": object,
    "queries": string[],
    "candidates": [{ "chunk_id": string, "score": number, "doc_version": string|null }]
  },
  "generation": {
    "model": string,
    "model_version": string|null,
    "prompt_version": string,
    "temperature": number|null
  },
  "result": {
    "status": "answered" | "not_found" | "needs_clarification" | "conflict" | "restricted" | "refused" | "error",
    "validated": boolean,
    "answer_json": object|null,
    "cited_chunk_ids": string[]
  }
}

Privacy and retention considerations

Decide:

  • What’s logged: raw text vs redacted vs hashed identifiers.
  • Where it’s stored: secure store with access control and audit.
  • How long it’s retained: short by default; longer only with policy.
  • Deletion: ability to delete logs when required.

For sensitive corpora, consider storing:

  • chunk ids and doc versions only,
  • not the raw chunk text (which is stored elsewhere under stricter controls).

Debug workflow: “why did it answer that?”

When an answer is wrong or suspicious, debug from logs:

  1. Check retrieval: were the right chunks retrieved? were filters correct?
  2. Check context packing: were the best chunks included in the prompt?
  3. Check citations: do citations match claims? are quotes accurate?
  4. Check versions: was the corpus stale? did doc versions change?
  5. Check prompt/model changes: did a new version alter behavior?

This workflow prevents random prompt tweaking and forces root-cause fixes.

Where to go next