24.3 Embeddings 101 for builders
Overview and links for this section of the guide.
On this page
Goal: understand embeddings well enough to debug retrieval
You don’t need to be a machine learning researcher to build RAG, but you do need a working mental model of embeddings.
The goal of this page is to make you dangerous enough to answer questions like:
- “Why are we retrieving irrelevant chunks?”
- “Why does the system miss obvious matches?”
- “Why did retrieval get worse after we changed preprocessing?”
- “How do we evaluate retrieval without guessing?”
- “When should we use hybrid search or reranking?”
What embeddings are (builder-friendly)
An embedding model converts text into a vector (a long list of numbers).
The practical idea: texts with similar meaning end up “near” each other in vector space.
- Chunk embedding: vector representing a document chunk.
- Query embedding: vector representing the user’s question.
- Similarity: a score that estimates how close the vectors are.
Retrieval usually means: embed the query, then find the nearest chunk vectors.
Embeddings can retrieve text that’s semantically close but factually irrelevant. That’s why ranking and guardrails matter.
Similarity search in practice
Most systems do:
- Store chunk vectors in a vector index.
- Compute a query vector at runtime.
- Run nearest-neighbor search (top-k).
- Optionally rerank top candidates with a stronger model.
- Include the best chunks in the prompt.
Important practical details:
- Distance metric: cosine similarity, dot product, or L2 distance.
- Approximate search: most vector stores use ANN for speed; results are “close enough.”
- Filtering: apply metadata filters (permissions, doc types) before ranking.
- Top-k tuning: retrieval k is not the same as “chunks to include in prompt.”
Common pitfalls that break retrieval
Embedding mismatch
- Different models: chunk embeddings created with one model, query embeddings with another.
- Different preprocessing: chunk text is normalized differently from query text.
- Different languages: your embedding model may be weaker for non-English content.
Bad chunk content
- Missing keywords: chunk lacks the terms users search for (headings missing).
- Too much boilerplate: repeated headers/footers dominate embeddings.
- Overlapping duplicates: overlap creates many near-identical chunks that crowd out diversity.
Query issues
- Too short: “refunds?” provides little semantic signal.
- Too specific in the wrong way: includes irrelevant details that bias similarity.
- Ambiguous intent: multiple plausible meanings without disambiguation.
Missing or wrong metadata
- No doc type tags: you retrieve tickets when you needed canonical policy.
- No permissions tags: you either leak data or over-filter and retrieve nothing.
- No versioning: you retrieve outdated chunks after an update.
Prompting can improve faithfulness to retrieved text, but it can’t invent the missing source. Fix retrieval first.
Practical tips for embedding pipelines
- Store raw text: always keep the exact chunk text alongside the embedding.
- Deduplicate boilerplate: remove repeated headers/footers before embedding.
- Keep chunk ids stable: citations and audits depend on stable references.
- Log retrieval results: store top-k chunk ids and scores per query for debugging.
- Batch embedding: embed chunks in batches and retry failures safely.
- Version embeddings: record embedding model name/version and re-embed intentionally.
How to measure retrieval quality
Evaluation starts with an eval set of questions.
For each question, you can label:
- relevant chunks (ideal), or
- relevant documents (good enough), or
- answerability (“should be answerable from corpus” vs “not found”).
Useful retrieval metrics:
- Recall@k: is at least one relevant chunk in the top k?
- MRR: how high does the first relevant chunk appear?
- Precision@k: how many of the top k are actually relevant?
You don’t need perfect labels to get signal. Even coarse labels catch regressions.
Copy-paste prompts
Prompt: rewrite queries for better retrieval
Rewrite this user question to improve document retrieval.
Rules:
- Keep the user intent the same.
- Expand acronyms and include likely keywords/synonyms.
- Output 3 rewritten queries: one short, one medium, one explicit.
Question: [user question]
Prompt: label retrieved chunks (quick relevance audit)
I will give you a question and 10 retrieved chunks (with ids).
Task:
1) Label each chunk as: relevant / partially relevant / irrelevant.
2) Explain why (1 sentence each).
3) Recommend how to improve retrieval (query rewrite, metadata filter, chunking fix).
Return as a table.