Home/
Part XI — Performance & Cost Optimization (Making It Fast and Affordable)/32. Token Economics for Builders/32.4 Batch processing vs interactive mode
32.4 Batch processing vs interactive mode
Overview and links for this section of the guide.
On this page
Interactive Mode: Optimize for Speed
When a human is waiting (chatbot, autocomplete), latency is king. You pay a premium for immediate availability.
- Use streaming.
- Use smaller models if possible.
- Keep context windows tight to reduce prefill time.
Batch Mode: Optimize for Throughput
Many AI tasks don't need to happen now. They just need to happen today.
- Summarizing yesterday's meeting logs.
- Tagging a backlog of 1,000 support tickets.
- Generating unit tests for an entire legacy codebase.
For these, use Batch Processing. You send a file with 10,000 requests, go to sleep, and wake up with the results.
Using the Batch API
Google AI Studio and Vertex AI often offer a "Batch API" or deferred processing mode. The benefits are massive:
- 50% Lower Cost: Batch requests are often priced significantly lower than real-time requests because they run during server downtime.
- Higher Rate Limits: You can queue up far more tokens than your per-minute quota would allow.
- Reliability: The platform manages retries and queueing for you.
Night Shift
If you are building a "Vibe Coding" tool that refactors code, don't make the developer watch it write. Design a "nightly refactor" agent that runs in batch mode and opens a PR in the morning.