39.5 Prompt compression and distillation

On this page

Why Compress
Techniques
Distillation
Where to go next

Why Compress

Long prompts cost more and may exceed context limits. Compression reduces tokens while preserving meaning.

Before: 5,000 tokens → $0.05 per call
After:  1,500 tokens → $0.015 per call (70% savings)

Techniques

1. Reference Instead of Inline

// Before: Include full examples
"Example 1: [500 tokens of code]
 Example 2: [500 tokens of code]
 Example 3: [500 tokens of code]"

// After: Reference examples
"Use patterns from examples #1-3 in your training."
// Store examples in system context or fine-tuning

2. Summarize Context

// Before: Full file contents
const prompt = files.map(f => f.content).join('\n');

// After: Summarize
async function summarizeForContext(files: File[]) {
  return Promise.all(files.map(async f => {
    if (f.content.length < 500) return f.content;
    
    return model.generate({
      prompt: `Summarize this file in 100 words, focusing on key functions and types:\n${f.content}`,
      maxTokens: 150
    });
  }));
}

3. Remove Redundancy

// Before: Verbose prompt
"You are an AI assistant that helps with code review. 
 As a code review assistant, your job is to review code.
 When reviewing code, you should look for bugs and issues.
 Make sure to check for bugs when you review the code."

// After: Concise
"You are a code reviewer. Find bugs, security issues, and style problems."

Distillation

Train a smaller/cheaper model to mimic a larger model's behavior:

// distillation.ts
async function createDistillationDataset(
  prompts: string[],
  teacherModel: string,  // e.g., 'gemini-1.5-pro'
  studentModel: string   // e.g., 'gemini-1.5-flash'
) {
  const dataset = [];
  
  for (const prompt of prompts) {
    const teacherResponse = await generate({
      model: teacherModel,
      prompt
    });
    
    dataset.push({
      input: prompt,
      output: teacherResponse,
      // Include chain of thought if available
      reasoning: teacherResponse.thinking
    });
  }
  
  // Fine-tune student model on teacher outputs
  await fineTune({
    model: studentModel,
    trainingData: dataset
  });
}

// Result: Flash model that performs like Pro for your use case

Measure Before Compressing

Always measure accuracy before and after compression. Some compression techniques degrade quality. Find the right balance for your use case.

Where to go next

40. Advanced Structured Output