33.1 Cutting prompt size without losing accuracy

Overview and links for this section of the guide.

Identifying Prompt Bloat

Big prompts are slow. The prefill stage (reading the prompt) is fast, but a huge prompt often leads to a "slower start" as the model attends to all tokens, and it increases the chance of the model generating a long-winded response.

Common sources of bloat:

  • Copy-pasted entire files when only a function signature was needed.
  • Excessive XML tags used for structure (JSON is tighter).
  • Over-polite instructions ("Please, if you would be so kind...").

Compression Techniques

  1. Remove "Chatter": Delete polite phrases. "Write a function to..." is better than "I was wondering if you could help me write a function to..."
  2. Use Reference IDs: Instead of repeating a filename 10 times, say "File A" and define it once.
  3. Ask for Brevity: Explicitly instruct: "Do not explain. Return code only." This saves massive output latency.
The "Code Only" Rule

The fastest way to reduce latency is to stop the model from explaining itself. If you just need the diff, ask for the diff. Explanations take seconds to generate.

Where to go next