Home/
Part XI — Performance & Cost Optimization (Making It Fast and Affordable)/33. Latency Optimization/33.3 Parallelizing retrieval and preprocessing
33.3 Parallelizing retrieval and preprocessing
Overview and links for this section of the guide.
On this page
The Serial Trap
A naive RAG app often does this:
1. Receive User Query
2. (Wait) Generate Embedding for Query
3. (Wait) Search Vector DB
4. (Wait) Fetch Full Documents
5. (Wait) Call LLM
6. Response
This "waterfall" kills performance.
Parallelizing the Pipeline
You can do many things at once:
- Speculative Retrieval: Start searching your docs as soon as the user stops typing (debounce), before they even hit enter.
- Parallel Chunks: If you need to summarize 5 documents, send 5 separate requests to the model in parallel (map-reduce), rather than asking it to summarize them one by one.
- Hybrid Search: Run your keyword search (Elasticsearch) and vector search (Pinecone) at the same time, then merge results.
The "Optimistic" UI
If you know the user will likely need a specific tool (e.g., they opened the "SQL Editor" tab), start pre-loading the schema context in the background so it's ready when they ask a question.