39.5 Prompt compression and distillation

Overview and links for this section of the guide.

The 100k Token Problem

Context is expensive. RAG retrieves too much. Chat history grows forever.

Compression Techniques

  • Auto-Summarization: Every 10 turns, ask a cheap model to summarize the history into 1 paragraph. Replace the history with that paragraph.
  • Lingua Franca: Use specific, dense language. Instead of "Please write a function that takes a string...", use "def parse(s: str) -> dict:". Models speak code better than English.
  • Filter irrelevant keys: If you are processing a JSON API response, delete all keys you don't use before putting it in the prompt.

Where to go next