21.2 Extracting structured data from images (carefully)

Overview and links for this section of the guide.

Goal: extraction you can trust

Extracting data from images is tempting because it feels “direct”: upload a screenshot of a receipt, get structured fields back.

The reality: image extraction is a lossy process. The only way to make it reliable is to treat it like engineering:

  • define a schema,
  • allow unknowns,
  • capture confidence and evidence,
  • verify with sampling and automated checks.
Never force the model to guess missing fields

In extraction, guessing is corruption. Prefer null + a reason.

Why image extraction fails in predictable ways

Most failures fall into a few buckets:

  • Legibility: low resolution, blur, compression artifacts, small fonts.
  • Ambiguity: similar-looking characters (0/O, 1/l), unclear decimals, cut-off text.
  • Layout confusion: multi-column documents, tables, dense receipts, wrapped lines.
  • Missing context: currency/unit not visible, date format ambiguous, partial screenshot.
  • Invented structure: the model outputs “the kind of fields this document usually has,” not what’s actually shown.

Your prompts should directly counter these failure modes.

Define an extraction contract (schema + rules)

An extraction contract has two parts:

  1. Schema: exact JSON fields, types, and allowed values.
  2. Rules: what to do when information is missing or unclear.

Good contract rules include:

  • No guessing: use null when not visible.
  • Capture uncertainty: per-field confidence (e.g., high|medium|low).
  • Capture evidence: include the exact text snippet you read (or a short quote).
  • Normalize carefully: preserve original formatting in raw fields when in doubt.
  • Include an “unparsed” bucket: leftover text that didn’t fit the schema.
Two-pass extraction is often better

Pass 1: identify document type and what fields are present. Pass 2: extract with the exact schema for that document type.

Tables, charts, and dense layouts

For tables and dense receipts, you want to prevent “helpful reformatting”:

  • Preserve ordering: represent rows as arrays; keep row order as seen.
  • Separate raw vs parsed: store the exact string and a normalized numeric value (if possible).
  • Validate arithmetic: totals should equal the sum of line items (when applicable).
  • Explicitly define units: currency, time zone, measurement units.

For charts, be especially careful: pixel-based reading of axis labels is error-prone. If the task is important, prefer a tool-based approach (OCR + chart parsing) and use the model to validate or interpret, not to extract every tick value.

Verification loop and quality checks

To make extraction production-worthy, add checks:

  • Sampling review: manually inspect a small set of extractions per batch.
  • Schema validation: reject malformed JSON and retry with a stricter prompt.
  • Consistency checks: totals, date formats, currency formats.
  • Confidence gating: route low-confidence fields for human review.
  • Golden set: maintain 25–100 labeled examples and track regression over time.

Copy-paste prompts

Prompt: strict JSON extraction with evidence

Extract data from the attached image. Output MUST be valid JSON.

Rules:
- Do not guess. If a value is not clearly visible, use null.
- For each extracted field, include a confidence: "high" | "medium" | "low".
- Include the exact raw text snippet you used as evidence.

Return JSON with this schema:
{
  "document_type": string,
  "fields": {
    "date": { "value": string|null, "confidence": string, "evidence": string|null },
    "total": { "value": number|null, "confidence": string, "evidence": string|null, "raw": string|null },
    "currency": { "value": string|null, "confidence": string, "evidence": string|null }
  },
  "line_items": [{
    "description": { "value": string|null, "confidence": string, "evidence": string|null },
    "amount": { "value": number|null, "confidence": string, "evidence": string|null, "raw": string|null }
  }],
  "unparsed_text": string[]
}

Prompt: arithmetic validation

You extracted line items and a total from the image.

Task:
1) Check whether sum(line_items.amount) matches total (within 0.01).
2) If it doesn’t match, list the most likely causes (missing item, tax, unreadable amount).
3) Propose the smallest follow-up question or re-crop needed to resolve.
Return a short checklist.

Where to go next