AGENCYSCRIPT
CoursesEnterpriseBlog
👑FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Choose Your Embedding Model Based on Domain, Not BenchmarksMatch the Model to the Task TypeThe Evaluation Harness You Actually NeedGet Chunking Right — It's More Consequential Than the ModelThe Chunk Size Trade-offStructural Chunking Beats Arbitrary Token SplittingThe Parent-Child Retrieval PatternMetadata Filtering Is Not OptionalBuild a Hybrid Filter ArchitectureReranking Changes the GameCross-Encoder RerankersIndexing Decisions That Bite You LaterApproximate Nearest Neighbor Trade-offsIndex Everything at Once, Or Pay LaterQuery-Side Engineering Is UnderratedHypothetical Document Embeddings (HyDE)Query Expansion and DecompositionHybrid Search: Dense Plus SparseFrequently Asked QuestionsWhat embedding model should I start with if I'm new to this?How do I know if my retrieval quality is good enough?Is vector search a replacement for traditional keyword search?How much does embedding and vector search infrastructure cost at scale?Should I fine-tune my embedding model?What's the most common reason retrieval pipelines fail in production?Key Takeaways
Home/Blog/Most Retrieval Pipelines Are Right Only 70% of the Time
General

Most Retrieval Pipelines Are Right Only 70% of the Time

A

Agency Script Editorial

Editorial Team

·May 17, 2026·11 min read

Embeddings and vector search have crossed from research curiosity into production infrastructure fast enough that most teams skipped the fundamentals. The result is a graveyard of retrieval pipelines that return the right documents about 70% of the time, frustrate users the other 30%, and whose owners have no idea why. If you're building anything with retrieval-augmented generation, semantic search, or AI-powered recommendation, the quality of your embedding strategy will determine whether the system feels intelligent or broken.

This article covers the practices that actually move the needle — not the obvious "chunk your documents" advice, but the specific decisions around model selection, chunking strategy, index configuration, and retrieval logic that separate mediocre systems from ones that earn user trust. The reasoning behind each practice matters as much as the rule itself, because every production system has constraints that require judgment, not just a checklist.

Understanding why embeddings behave the way they do requires a mental model of what they are. An embedding is a dense numerical vector — typically 384 to 3,072 dimensions — that encodes meaning by positioning text in a geometric space where similar meaning means closer proximity. Vector search is the machinery that finds nearest neighbors in that space efficiently. The failure modes are almost always semantic, not technical: you get back vectors that are mathematically close but contextually wrong because the model didn't understand your domain, your chunks broke concepts apart, or your query phrasing didn't match how the documents were written.

Choose Your Embedding Model Based on Domain, Not Benchmarks

The standard mistake is picking a model because it tops the MTEB leaderboard, then wondering why retrieval quality is disappointing in production. Benchmarks measure average performance across heterogeneous datasets. Your use case is not average.

Match the Model to the Task Type

Embedding models are trained with different objectives. Some optimize for semantic similarity (two sentences meaning roughly the same thing score high). Others optimize for retrieval (a short query matching a long, information-dense passage). These are related but not identical tasks, and using the wrong type costs you 10–20 percentage points of recall in typical deployments.

  • Asymmetric retrieval (short query, long document): Use a model trained with bi-encoder retrieval objectives — bge-large-en-v1.5, e5-large-v2, or OpenAI's text-embedding-3-large with explicit query/passage prefixes.
  • Symmetric similarity (comparing passages of similar length): General-purpose models like all-mpnet-base-v2 work well.
  • Domain-specific content (legal, medical, financial, code): Fine-tuned or domain-adapted models outperform general models by meaningful margins. If you're embedding medical records and can't fine-tune, at minimum test bge-large-en-v1.5 against text-embedding-ada-002 on a sample of your actual data before committing.

The Evaluation Harness You Actually Need

Don't evaluate models on generic benchmarks. Build a golden set: 50–100 representative queries paired with the documents each should retrieve. Score models by recall@5 and recall@10. This takes half a day to build and will save weeks of debugging downstream. Run every candidate model through this harness before deployment, and re-run it whenever you swap models or significantly change your corpus.

Get Chunking Right — It's More Consequential Than the Model

Chunking strategy has more impact on retrieval quality than model choice in most real-world systems. The reason is straightforward: a great model can't retrieve a concept if that concept was split across two chunks.

The Chunk Size Trade-off

  • Too small (under ~100 tokens): Chunks lose context. The model encodes local phrasing but not the surrounding idea. Retrieval returns fragments that score well semantically but are useless without what came before or after.
  • Too large (over ~600 tokens): The embedding averages over too much information. A chunk discussing five topics will be a mediocre match for queries about any single one of them.
  • Sweet spot for most prose: 200–400 tokens with 15–20% overlap between adjacent chunks. The overlap ensures concepts straddling a boundary exist fully in at least one chunk.

Structural Chunking Beats Arbitrary Token Splitting

Split on semantic units — paragraphs, sections, sentences grouped by topic — not on fixed token counts. A document with clear headings should be chunked at heading boundaries, not mid-paragraph at token 512. Many production failures trace back to naive text splitting that ignores document structure entirely.

For structured sources like product catalogs, support documentation, or FAQs: chunk by record or by question-answer pair. A 40-token FAQ entry embedded as a unit retrieves dramatically better than the same content merged into a larger block and averaged away.

The Parent-Child Retrieval Pattern

One of the most effective patterns in production: embed small child chunks for precision, but return larger parent chunks to the LLM for context. Retrieve at the sentence or short-paragraph level (high specificity), then look up the containing section (enough context for the model to answer). This separates the retrieval precision problem from the context adequacy problem.

Metadata Filtering Is Not Optional

Pure vector search — finding the top-k nearest vectors globally — breaks down in any corpus with natural subcategories. If you have product documentation for 50 different software versions, a query about version 3.2 should not compete for retrieval slots against identical-sounding passages from version 4.1.

Build a Hybrid Filter Architecture

Every serious vector database (Pinecone, Weaviate, Qdrant, pgvector with extensions) supports filtering on structured metadata alongside vector similarity. Use it. Attach metadata at index time: document type, date, source, category, version, author, language. At query time, pre-filter to the relevant subset before running the vector search.

The trap is over-filtering: narrow your pre-filter too aggressively and you don't have enough candidates for the similarity search to find genuinely useful results. The practical rule: ensure your filtered candidate set contains at least 20× your top-k value before applying semantic ranking. If top-k is 5, you need 100+ candidates after filtering.

Reranking Changes the Game

Initial vector retrieval is recall-oriented: get plausible candidates fast. Reranking is precision-oriented: among those candidates, identify what's actually most relevant. Treating your top-k vector results as final is leaving significant quality on the table.

Cross-Encoder Rerankers

Cross-encoders (like Cohere Rerank, ms-marco-MiniLM-L-6-v2, or bge-reranker-large) take a query-document pair and score relevance directly, rather than comparing independent vectors. They're slower and more expensive than bi-encoder retrieval, but their relevance scores are substantially more accurate. The typical production pattern:

  1. Retrieve top 20–50 candidates via vector search (fast, cheap)
  2. Rerank with a cross-encoder (slower, applied to only 20–50 pairs)
  3. Pass the top 3–5 reranked results to the LLM

This two-stage architecture costs roughly 3–5× more compute than vector-only retrieval but consistently improves answer quality in RAG systems by a meaningful margin — typical gains in precision@3 run 15–30% depending on the domain.

Indexing Decisions That Bite You Later

Understanding how generative AI works at a mechanical level clarifies why index configuration matters: the retrieval layer feeds the generation layer, and garbage in means garbage out regardless of model quality.

Approximate Nearest Neighbor Trade-offs

Most vector databases use approximate nearest neighbor (ANN) algorithms — HNSW being the dominant choice — rather than exact search. The approximation trades some recall for speed. The parameters you set at index time (for HNSW: ef_construction and M) control this trade-off:

  • Higher ef_construction and M: better recall, more memory, slower indexing
  • Lower values: faster, cheaper, but you're missing relevant results

Default settings are tuned for average cases. If your queries are latency-sensitive and your corpus is under 1M vectors, you can afford higher quality settings. If you're at 10M+ vectors and need sub-100ms response, you're making a genuine recall trade-off. Know which situation you're in before accepting defaults.

Index Everything at Once, Or Pay Later

Adding embeddings incrementally without periodic re-indexing degrades ANN index quality over time in many implementations. Schedule regular full re-index runs if your corpus changes significantly. This is operational discipline, not a nice-to-have.

Query-Side Engineering Is Underrated

Most teams spend 80% of their effort on document processing and 20% on query handling. The split should be closer to 60/40. The common mistakes in AI system design pattern shows up clearly here: teams assume the user's query is the right unit of retrieval.

Hypothetical Document Embeddings (HyDE)

When a user query is too short or abstract to match document phrasing well, generate a hypothetical answer first, then embed that hypothetical answer for retrieval. The embedding of "a document that would answer this question" often retrieves better than the embedding of the question itself. This works because documents and queries live in subtly different parts of the embedding space for asymmetric retrieval tasks.

Query Expansion and Decomposition

For complex queries, decompose into sub-queries and retrieve against each independently before merging results. A question like "What are the pricing differences between the enterprise and starter tiers, and which includes SSO?" contains two distinct retrieval needs. Embedding the full question averages across both and may retrieve well for neither.

Query expansion — appending synonyms or related terms before embedding — is less effective with dense embeddings than it was with BM25, but it still helps in narrow-domain corpora where specific terminology varies (users saying "login" versus "authentication" versus "sign-in").

Hybrid Search: Dense Plus Sparse

Pure vector search misses exact keyword matches that users expect to work. "What is the refund policy for order #88421" contains a specific token (#88421) that semantic search will happily ignore in favor of general refund policy documents. BM25 and other sparse retrieval methods handle exact token matching precisely where dense search fails.

The practical guidance on generative AI applications makes clear that hybrid approaches consistently outperform single-method retrieval across real-world use cases. The implementation pattern:

  • Run dense vector search and sparse BM25 in parallel
  • Merge with Reciprocal Rank Fusion (RRF), which is parameter-light and robust
  • Pass the merged top-k to a reranker

This adds engineering complexity but is worth it for any customer-facing system where retrieval failures are visible. For internal tools with lower stakes, pure dense search is often sufficient.

Frequently Asked Questions

What embedding model should I start with if I'm new to this?

Start with text-embedding-3-small from OpenAI if you want managed infrastructure, or bge-base-en-v1.5 if you want a capable open-source option you can self-host. Both perform well on general English text. Build your evaluation harness first, run both through it on your actual data, and pick based on results — not marketing copy.

How do I know if my retrieval quality is good enough?

Define "good enough" with a golden test set before you build anything else. Track recall@5 (does the right document appear in the top 5 results?) as your primary metric. A baseline above 80% is functional; above 90% is strong for most production applications. Below 70%, something is fundamentally wrong — usually chunking or model mismatch — and adding more infrastructure won't fix it.

Is vector search a replacement for traditional keyword search?

No, and treating it as one is a recurring failure mode. Dense vector search handles semantic and conceptual matching better than keyword search. Keyword search handles exact terms, IDs, codes, and rare proper nouns better than vector search. Hybrid architectures that combine both consistently outperform either alone, which is why Elasticsearch, OpenSearch, and most dedicated vector databases now support hybrid retrieval natively.

How much does embedding and vector search infrastructure cost at scale?

Costs vary widely based on corpus size, query volume, and latency requirements. Rough ranges: embedding generation runs $0.02–$0.13 per million tokens depending on model. Managed vector database hosting runs $70–$300/month for moderate production workloads (1–10M vectors, moderate query volume). Self-hosted options on cloud VMs can be cheaper at scale but add operational overhead. The biggest hidden cost is re-embedding when you change models — an argument for getting model selection right early.

Should I fine-tune my embedding model?

Fine-tuning pays off when your domain has specialized vocabulary or phrasing that general models haven't seen, and when you have at least a few thousand labeled query-document pairs to train on. Legal, biomedical, and highly technical domains are the clearest cases. For most business applications, a well-configured general model with good chunking and reranking will outperform a poorly fine-tuned one. Don't fine-tune to avoid fixing your chunking strategy.

What's the most common reason retrieval pipelines fail in production?

Chunking strategy that ignores document structure. Text splitters that cut at fixed token counts mid-sentence, mid-table, or mid-list create fragments that embed poorly and retrieve worse. The second most common failure is no reranking step — teams retrieve top-5 with pure vector similarity and pass all five to the LLM, including the two that are semantically adjacent but contextually irrelevant.

Key Takeaways

  • Evaluate embedding models against your actual data with a golden test set, not leaderboard benchmarks.
  • Chunk on semantic units (paragraphs, sections, records), not fixed token counts; 200–400 tokens with 15–20% overlap is a reliable starting range for prose.
  • Attach structured metadata at index time and use pre-filtering at query time — but keep filtered candidate sets large enough for meaningful vector search.
  • Add a cross-encoder reranking step between retrieval and generation; it reliably improves precision at modest additional cost.
  • Run dense and sparse (BM25) retrieval in parallel and merge with Reciprocal Rank Fusion for customer-facing systems where retrieval failures are visible.
  • Invest in query-side engineering: HyDE, decomposition, and expansion address cases where the user's raw query doesn't match document phrasing.
  • ANN index parameters are a deliberate recall-vs-latency trade-off, not an afterthought — set them based on your actual scale and latency requirements.
  • The gap between a retrieval pipeline that works 70% of the time and one that works 90% of the time is almost always chunking, reranking, or query handling — not the choice of vector database.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification