When RAG Pulls the Wrong Chunks, It Is Your Pipeline

If you've ever watched a retrieval-augmented generation (RAG) pipeline hallucinate confidently because it pulled the wrong chunks, or seen a semantic search return results that are technically similar but contextually useless, you've felt the gap between understanding embeddings in theory and deploying them well in practice. That gap is almost always an implementation problem, not a model problem.

Embeddings convert text (and increasingly images, audio, and structured data) into high-dimensional numeric vectors that encode meaning. Vector search finds the nearest neighbors to a query vector in that space. Together they power semantic search, RAG systems, recommendation engines, and duplicate detection—applications that are now table stakes for competitive agencies. The problem is that most teams treat this infrastructure as a black box they stand up once and forget.

This checklist changes that. It's organized as a working tool: skim the headers to find your current bottleneck, read the justification to understand the stakes, then act. Whether you're scoping a new project or auditing a system that's quietly degrading, every item here addresses a real failure mode observed across production deployments.

Choose the Right Embedding Model

The model you pick determines the ceiling on retrieval quality. No amount of downstream tuning recovers from a weak encoder.

Match the model to your domain and modality

General-purpose text models (OpenAI text-embedding-3-large, Cohere embed-v3, open-source bge-large-en) work well for mixed enterprise content. Expect strong performance on English prose, weaker on highly technical jargon, code, or non-English languages without explicit multilingual support.
Multilingual models (e.g., multilingual-e5-large, Cohere's multilingual variant) are necessary if your corpus or users operate across languages—don't assume an English-dominant model will generalize.
Domain-specific or fine-tuned models pay off when your content is specialized (legal, medical, financial) and your query vocabulary diverges significantly from general web text. Fine-tuning on even 1,000–5,000 labeled pairs can move retrieval recall by 10–20 percentage points.
Multimodal models (CLIP, GPT-4V-based embedders) are required if you're indexing or querying across images and text together.

Check dimension count and latency trade-offs

Larger dimensions (1,536, 3,072) capture more nuance but cost more to store and search. Many models support Matryoshka Representation Learning (MRL), which lets you truncate dimensions at inference time—useful when you need to trade off precision for speed at scale. Test at your actual query volume before committing to a dimension size.

Preprocess and Chunk Your Source Content Thoughtfully

The vector for a badly constructed chunk is a bad vector, regardless of the model.

Define chunk strategy before you index

Fixed-size chunking (e.g., 256 or 512 tokens with overlap) is fast to implement but breaks semantic units arbitrarily. Use it as a baseline, not a final answer.
Sentence or paragraph chunking respects natural boundaries and usually outperforms fixed-size on retrieval quality.
Hierarchical chunking indexes both sentence-level and document-level chunks, then uses parent-chunk retrieval to return more context after a sentence chunk matches. This is the current best practice for RAG systems with long documents.
Overlap (typically 10–20% of chunk size) prevents information from falling through the seams between chunks. More overlap means more storage and more candidates to re-rank; tune this deliberately.

Clean before you embed

Strip navigation boilerplate, repeated headers, footers, and low-information markup before embedding. Noise in the text becomes noise in the vector. For scanned PDFs, verify OCR quality—embedding garbled text produces vectors that are confidently wrong.

Design Your Metadata Schema

Embeddings capture semantics; metadata handles everything else.

Always store source document ID, chunk index, timestamp, and content type as filterable fields alongside every vector. This lets you apply hard filters before vector search runs, which is faster and more reliable than hoping semantic similarity handles it.
Add access-control metadata (user role, tenant ID, document classification) if your system serves multiple clients or permission levels. Filtering on these fields during retrieval is the correct mechanism—not post-hoc result suppression.
Capture version or update timestamps so you can identify and re-embed stale chunks when source content changes. Stale embeddings are a silent accuracy killer.

Select and Configure Your Vector Database

The vector database is infrastructure, not an afterthought.

Evaluate on your actual workload

Key dimensions to evaluate: query latency at your p99 percentile, indexing throughput, filtered search accuracy, cost per million vectors stored, and ease of hybrid search (combining vector similarity with keyword BM25 scores). Common production options as of 2025–2026 include Pinecone, Weaviate, Qdrant, pgvector (for teams already on PostgreSQL), and Milvus. Each has a different sweet spot; there is no universal winner.

Configure ANN index parameters deliberately

Most vector databases use approximate nearest neighbor (ANN) algorithms like HNSW. The two parameters that matter most are:

`ef_construction` / `M` (HNSW): Higher values mean better recall but slower indexing and more memory. Don't use defaults in production without benchmarking.
`ef` (search-time): Controls the recall-speed trade-off at query time. Tune this after measuring your baseline recall on a held-out evaluation set.

Target at least 95% recall@10 for most retrieval applications before going live.

Implement Hybrid Search

Pure vector search has a well-documented weakness: it misses exact keyword matches that a user clearly intends. A query for a specific product SKU, a person's name, or a regulatory code will often retrieve semantically related but factually wrong results from a vector-only system.

Combine BM25 (sparse) + vector (dense) scores using Reciprocal Rank Fusion (RRF) or a learned linear combination. Most major vector databases now support this natively.
Use keyword search as a hard filter or a boost, not a replacement. The goal is complementarity, not competition.
Re-rank after retrieval: A cross-encoder re-ranker (Cohere Rerank, bge-reranker-large, or a fine-tuned model) applied to the top 20–50 candidates consistently improves final top-5 quality by meaningful margins. The cross-encoder is too slow to run over the full index but is fast enough over a small candidate set.

This hybrid approach is one of the most impactful interventions in the The How Generative AI Works Checklist for 2026 as well, because retrieval quality directly gates generation quality.

Build an Evaluation Pipeline Before You Optimize

Teams that skip evaluation end up optimizing intuitions, not outcomes. The retrieval system that "feels better" after a change often isn't.

Create a labeled evaluation set

Collect 100–500 query/relevant-document pairs that represent your real user questions. This doesn't require annotation software—a spreadsheet works.
Measure Recall@K (are the right chunks in the top K results?), MRR (Mean Reciprocal Rank), and NDCG (normalized discounted cumulative gain) depending on how much rank ordering matters to your use case.
Run evaluation before and after every significant change: chunking strategy, model swap, index parameter tuning, or re-ranker addition.

Monitor in production

Log query vectors, retrieved chunk IDs, and (when possible) user feedback signals. A sudden drop in user engagement or an uptick in downstream LLM hallucinations is often a retrieval problem, not a generation problem. See 7 Common Mistakes with How Generative AI Works (and How to Avoid Them) for a detailed breakdown of how retrieval failures cascade into generation failures.

Manage Index Lifecycle and Re-Embedding

Embeddings are not static artifacts. They degrade as your content changes and as better models become available.

Establish an update policy: When source documents are added or edited, re-embed affected chunks within a defined SLA (same-day for high-velocity content, weekly for stable corpora).
Plan for model migration: When you upgrade your embedding model, you must re-embed the entire corpus—you cannot mix vectors from different models in the same index without breaking similarity calculations. Build the re-indexing pipeline before you need it, not after.
Archive old indexes rather than deleting them immediately. If a model migration introduces a regression, you need a rollback path.
Monitor embedding drift: If your source content vocabulary shifts significantly over time (new product lines, regulatory changes, shifting user language), even a good model's embeddings may lose relevance. Periodic evaluation set refreshes catch this.

Secure Your Vector Infrastructure

Vector databases often receive less security scrutiny than traditional databases, which is a mistake.

Treat embeddings as sensitive data: Vectors can be inverted—partially reconstructed back toward the original text—with sufficient effort. Store them with the same access controls as the source content.
Apply tenant isolation at the collection or namespace level, not just at query-filter time. A misconfigured filter is a data leak; architectural separation is not.
Rotate API keys and audit access logs on the same schedule as the rest of your AI infrastructure. If your vector database vendor offers VPC peering or private endpoints, use them for production workloads.
For agency operators building client-facing systems, review the security architecture against your client's data handling requirements before deployment. How Generative AI Works: Best Practices That Actually Work covers this operational layer in more depth.

Frequently Asked Questions

What's the difference between embeddings and vector search?

Embeddings are the numeric representations—vectors—that encode the meaning of a piece of content. Vector search is the process of finding stored vectors that are closest (most similar) to a query vector. You need both: embeddings without search infrastructure are representations with nowhere to go; vector search without good embeddings is fast retrieval of the wrong things.

How many chunks do I need before a vector database is worth using?

Practical thresholds vary, but below roughly 50,000–100,000 chunks, a simple cosine similarity search over an in-memory array (using NumPy or similar) is often fast enough and avoids operational overhead. Above that scale, or if you need filtered search, persistence, or high concurrency, a dedicated vector database earns its place.

Can I use the same embedding model for both indexing and querying?

Yes, and you must. Mixing models—indexing with one encoder and querying with another—produces vectors in incompatible spaces. Similarity scores become meaningless. If you change your embedding model, re-index everything. This is non-negotiable.

How do I handle content that updates frequently?

Build a change-detection layer that watches your source content (via webhook, database trigger, or scheduled diff) and queues affected chunks for re-embedding. Most vector databases support upsert operations, so you can update individual chunks without rebuilding the full index. How Generative AI Works: Real-World Examples and Use Cases includes examples of content-heavy deployments that handle this at scale.

Is hybrid search always better than pure vector search?

For most enterprise retrieval tasks, yes. Pure vector search underperforms on queries that contain exact identifiers, proper nouns, or rare terms. The cost of adding BM25 is low relative to the recall gains, especially for customer-facing applications where a single bad result is visible and damaging.

What causes retrieval to degrade silently over time?

Three main culprits: stale embeddings from updated source content that wasn't re-indexed, vocabulary drift where new terms appear in queries that weren't represented in the training distribution, and index fragmentation in some ANN implementations that accumulates as you do many incremental updates. Regular evaluation runs—not just launch-time testing—catch all three.

Key Takeaways

Match your embedding model to your domain, language, and modality before anything else; the model choice sets your quality ceiling.
Chunk strategy matters as much as the model—hierarchical chunking with overlap is the current best practice for RAG.
Store rich, filterable metadata alongside every vector; semantic similarity alone cannot handle access control, recency, or exact-match requirements.
Configure ANN index parameters explicitly; defaults are rarely optimized for production recall or latency requirements.
Hybrid search (dense + sparse + re-ranking) consistently outperforms pure vector search for real-world enterprise queries.
Build a labeled evaluation set before you start optimizing; without ground truth, improvements are guesses.
Treat embedding infrastructure as a living system: update policies, model migration plans, and security controls must be designed in, not bolted on.

Choose the Right Embedding Model

The model you pick determines the ceiling on retrieval quality. No amount of downstream tuning recovers from a weak encoder.

Match the model to your domain and modality

General-purpose text models (OpenAI text-embedding-3-large, Cohere embed-v3, open-source bge-large-en) work well for mixed enterprise content. Expect strong performance on English prose, weaker on highly technical jargon, code, or non-English languages without explicit multilingual support.
Multilingual models (e.g., multilingual-e5-large, Cohere's multilingual variant) are necessary if your corpus or users operate across languages—don't assume an English-dominant model will generalize.
Domain-specific or fine-tuned models pay off when your content is specialized (legal, medical, financial) and your query vocabulary diverges significantly from general web text. Fine-tuning on even 1,000–5,000 labeled pairs can move retrieval recall by 10–20 percentage points.
Multimodal models (CLIP, GPT-4V-based embedders) are required if you're indexing or querying across images and text together.

Check dimension count and latency trade-offs

Preprocess and Chunk Your Source Content Thoughtfully

The vector for a badly constructed chunk is a bad vector, regardless of the model.

Define chunk strategy before you index

Fixed-size chunking (e.g., 256 or 512 tokens with overlap) is fast to implement but breaks semantic units arbitrarily. Use it as a baseline, not a final answer.
Sentence or paragraph chunking respects natural boundaries and usually outperforms fixed-size on retrieval quality.
Hierarchical chunking indexes both sentence-level and document-level chunks, then uses parent-chunk retrieval to return more context after a sentence chunk matches. This is the current best practice for RAG systems with long documents.
Overlap (typically 10–20% of chunk size) prevents information from falling through the seams between chunks. More overlap means more storage and more candidates to re-rank; tune this deliberately.

Clean before you embed

Design Your Metadata Schema

Embeddings capture semantics; metadata handles everything else.

Always store source document ID, chunk index, timestamp, and content type as filterable fields alongside every vector. This lets you apply hard filters before vector search runs, which is faster and more reliable than hoping semantic similarity handles it.
Add access-control metadata (user role, tenant ID, document classification) if your system serves multiple clients or permission levels. Filtering on these fields during retrieval is the correct mechanism—not post-hoc result suppression.
Capture version or update timestamps so you can identify and re-embed stale chunks when source content changes. Stale embeddings are a silent accuracy killer.

Select and Configure Your Vector Database

The vector database is infrastructure, not an afterthought.

Evaluate on your actual workload

Configure ANN index parameters deliberately

Most vector databases use approximate nearest neighbor (ANN) algorithms like HNSW. The two parameters that matter most are:

`ef_construction` / `M` (HNSW): Higher values mean better recall but slower indexing and more memory. Don't use defaults in production without benchmarking.
`ef` (search-time): Controls the recall-speed trade-off at query time. Tune this after measuring your baseline recall on a held-out evaluation set.

Target at least 95% recall@10 for most retrieval applications before going live.

Implement Hybrid Search

Combine BM25 (sparse) + vector (dense) scores using Reciprocal Rank Fusion (RRF) or a learned linear combination. Most major vector databases now support this natively.
Use keyword search as a hard filter or a boost, not a replacement. The goal is complementarity, not competition.
Re-rank after retrieval: A cross-encoder re-ranker (Cohere Rerank, bge-reranker-large, or a fine-tuned model) applied to the top 20–50 candidates consistently improves final top-5 quality by meaningful margins. The cross-encoder is too slow to run over the full index but is fast enough over a small candidate set.

This hybrid approach is one of the most impactful interventions in the The How Generative AI Works Checklist for 2026 as well, because retrieval quality directly gates generation quality.

Build an Evaluation Pipeline Before You Optimize

Teams that skip evaluation end up optimizing intuitions, not outcomes. The retrieval system that "feels better" after a change often isn't.

Create a labeled evaluation set

Collect 100–500 query/relevant-document pairs that represent your real user questions. This doesn't require annotation software—a spreadsheet works.
Measure Recall@K (are the right chunks in the top K results?), MRR (Mean Reciprocal Rank), and NDCG (normalized discounted cumulative gain) depending on how much rank ordering matters to your use case.
Run evaluation before and after every significant change: chunking strategy, model swap, index parameter tuning, or re-ranker addition.

Monitor in production

Manage Index Lifecycle and Re-Embedding

Embeddings are not static artifacts. They degrade as your content changes and as better models become available.

Establish an update policy: When source documents are added or edited, re-embed affected chunks within a defined SLA (same-day for high-velocity content, weekly for stable corpora).
Plan for model migration: When you upgrade your embedding model, you must re-embed the entire corpus—you cannot mix vectors from different models in the same index without breaking similarity calculations. Build the re-indexing pipeline before you need it, not after.
Archive old indexes rather than deleting them immediately. If a model migration introduces a regression, you need a rollback path.
Monitor embedding drift: If your source content vocabulary shifts significantly over time (new product lines, regulatory changes, shifting user language), even a good model's embeddings may lose relevance. Periodic evaluation set refreshes catch this.

Secure Your Vector Infrastructure

Vector databases often receive less security scrutiny than traditional databases, which is a mistake.

Treat embeddings as sensitive data: Vectors can be inverted—partially reconstructed back toward the original text—with sufficient effort. Store them with the same access controls as the source content.
Apply tenant isolation at the collection or namespace level, not just at query-filter time. A misconfigured filter is a data leak; architectural separation is not.
Rotate API keys and audit access logs on the same schedule as the rest of your AI infrastructure. If your vector database vendor offers VPC peering or private endpoints, use them for production workloads.
For agency operators building client-facing systems, review the security architecture against your client's data handling requirements before deployment. How Generative AI Works: Best Practices That Actually Work covers this operational layer in more depth.

Frequently Asked Questions

What's the difference between embeddings and vector search?

How many chunks do I need before a vector database is worth using?

Can I use the same embedding model for both indexing and querying?

How do I handle content that updates frequently?

Is hybrid search always better than pure vector search?

What causes retrieval to degrade silently over time?

Key Takeaways

Match your embedding model to your domain, language, and modality before anything else; the model choice sets your quality ceiling.
Chunk strategy matters as much as the model—hierarchical chunking with overlap is the current best practice for RAG.
Store rich, filterable metadata alongside every vector; semantic similarity alone cannot handle access control, recency, or exact-match requirements.
Configure ANN index parameters explicitly; defaults are rarely optimized for production recall or latency requirements.
Hybrid search (dense + sparse + re-ranking) consistently outperforms pure vector search for real-world enterprise queries.
Build a labeled evaluation set before you start optimizing; without ground truth, improvements are guesses.
Treat embedding infrastructure as a living system: update policies, model migration plans, and security controls must be designed in, not bolted on.

When RAG Pulls the Wrong Chunks, It Is Your Pipeline

Choose the Right Embedding Model

Match the model to your domain and modality

Check dimension count and latency trade-offs

Preprocess and Chunk Your Source Content Thoughtfully

Define chunk strategy before you index

Clean before you embed

Design Your Metadata Schema

Select and Configure Your Vector Database

Evaluate on your actual workload

Configure ANN index parameters deliberately

Implement Hybrid Search

Build an Evaluation Pipeline Before You Optimize

Create a labeled evaluation set

Monitor in production

Manage Index Lifecycle and Re-Embedding

Secure Your Vector Infrastructure

Frequently Asked Questions

What's the difference between embeddings and vector search?

How many chunks do I need before a vector database is worth using?

Can I use the same embedding model for both indexing and querying?

How do I handle content that updates frequently?

Is hybrid search always better than pure vector search?

What causes retrieval to degrade silently over time?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

When RAG Pulls the Wrong Chunks, It Is Your Pipeline

Choose the Right Embedding Model

Match the model to your domain and modality

Check dimension count and latency trade-offs

Preprocess and Chunk Your Source Content Thoughtfully

Define chunk strategy before you index

Clean before you embed

Design Your Metadata Schema

Select and Configure Your Vector Database

Evaluate on your actual workload

Configure ANN index parameters deliberately

Implement Hybrid Search

Build an Evaluation Pipeline Before You Optimize

Create a labeled evaluation set

Monitor in production

Manage Index Lifecycle and Re-Embedding

Secure Your Vector Infrastructure

Frequently Asked Questions

What's the difference between embeddings and vector search?

How many chunks do I need before a vector database is worth using?

Can I use the same embedding model for both indexing and querying?

How do I handle content that updates frequently?

Is hybrid search always better than pure vector search?

What causes retrieval to degrade silently over time?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?