26.5 Auditing: storing which chunks influenced answers

On this page

Goal: make every answer explainable later
Why auditing matters (beyond compliance)
What to log (minimum viable)
Audit log schema (practical)
Privacy and retention considerations
Debug workflow: “why did it answer that?”
Where to go next

Goal: make every answer explainable later

If a user asks “why did it say that?”, you should be able to answer with evidence:

which sources were retrieved,
which were included in the prompt,
which were cited in the answer,
which model/prompt versions were used.

This is how you debug issues and build trust.

Why auditing matters (beyond compliance)

Auditing is valuable even if you don’t have compliance requirements:

Debugging: reproduce failures and fix the right layer (retrieval vs prompt vs corpus).
Quality improvement: find common “not found” gaps and update docs.
Regression detection: identify when a new prompt version changed behavior.
Security: detect suspicious queries or attempted prompt injection patterns.

Audit logs can become a privacy liability

Log identifiers and hashes when possible. Be deliberate about storing raw user inputs and source text.

What to log (minimum viable)

At minimum, log:

request metadata: timestamp, request id, user/tenant id (or hashed), environment.
question: raw or redacted question text.
retrieval: query variants, filters used, top-k chunk ids and scores.
prompt version: identifier for the prompt template and rules.
model version: model name, parameters (temperature), safety settings.
answer: final JSON output plus validation status.
citations: which chunk ids were cited.

Optionally log:

token usage and latency breakdown,
reranking decisions,
“not found” missing_info fields,
conflict detection outputs.

Audit log schema (practical)

{
  "request_id": string,
  "timestamp": string,
  "actor": { "user_id": string|null, "tenant_id": string|null },
  "question": { "text": string, "redacted": boolean },
  "retrieval": {
    "filters": object,
    "queries": string[],
    "candidates": [{ "chunk_id": string, "score": number, "doc_version": string|null }]
  },
  "generation": {
    "model": string,
    "model_version": string|null,
    "prompt_version": string,
    "temperature": number|null
  },
  "result": {
    "status": "answered" | "not_found" | "needs_clarification" | "conflict" | "restricted" | "refused" | "error",
    "validated": boolean,
    "answer_json": object|null,
    "cited_chunk_ids": string[]
  }
}

Privacy and retention considerations

Decide:

What’s logged: raw text vs redacted vs hashed identifiers.
Where it’s stored: secure store with access control and audit.
How long it’s retained: short by default; longer only with policy.
Deletion: ability to delete logs when required.

For sensitive corpora, consider storing:

chunk ids and doc versions only,
not the raw chunk text (which is stored elsewhere under stricter controls).

Debug workflow: “why did it answer that?”

When an answer is wrong or suspicious, debug from logs:

Check retrieval: were the right chunks retrieved? were filters correct?
Check context packing: were the best chunks included in the prompt?
Check citations: do citations match claims? are quotes accurate?
Check versions: was the corpus stale? did doc versions change?
Check prompt/model changes: did a new version alter behavior?

This workflow prevents random prompt tweaking and forces root-cause fixes.

26.5 Auditing: storing which chunks influenced answers

Goal: make every answer explainable later

Why auditing matters (beyond compliance)

What to log (minimum viable)

Audit log schema (practical)

Privacy and retention considerations

Debug workflow: “why did it answer that?”

Where to go next