AGENCYSCRIPT
CoursesEnterpriseBlog
👑FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Mistake 1: Using the Wrong Embedding Model for Your DomainWhy it happensThe costThe fixMistake 2: Chunking Documents ArbitrarilyWhy it happensThe costThe fixMistake 3: Ignoring Metadata and FilteringWhy it happensThe costThe fixMistake 4: Embedding Queries and Documents InconsistentlyWhy it happensThe costThe fixMistake 5: Neglecting Index Maintenance as Your Corpus EvolvesWhy it happensThe costThe fixMistake 6: Relying Solely on Cosine Similarity Without RerankingWhy it happensThe costThe fixMistake 7: Skipping Retrieval Evaluation EntirelyWhy it happensThe costThe fixFrequently Asked QuestionsWhat is the difference between embeddings and vector search?How do I choose the right chunk size for my documents?Can I mix different embedding models in the same vector index?What vector database should I use?How often should I re-embed my corpus?Is reranking always necessary?Key Takeaways
Home/Blog/Vector Search Fails Subtly, in Seven Recognizable Ways
General

Vector Search Fails Subtly, in Seven Recognizable Ways

A

Agency Script Editorial

Editorial Team

·May 18, 2026·11 min read

Embeddings and vector search are the quiet engine behind retrieval-augmented generation, semantic search, recommendation systems, and a growing slice of enterprise AI infrastructure. When they work, they feel like magic: a user types a vague question, and the system surfaces exactly the right document. When they fail, the failure is subtle and hard to diagnose — you get results that look plausible but aren't useful, or a system that degrades slowly over time without anyone noticing why.

Most teams run into trouble not because the technology is inherently complex, but because the decisions that matter most happen before the first line of application code is written. Which model generates your embeddings? How do you chunk your documents? What does "similar" actually mean for your use case? Skip past those questions too quickly, and you build on sand. This article names seven real failure modes, explains why each one happens, what it costs you, and what the corrective practice looks like.

If you're newer to the underlying mechanics of how language models encode meaning, The Complete Guide to How Generative AI Works provides a strong foundation before diving into the specifics below.

Mistake 1: Using the Wrong Embedding Model for Your Domain

The most common mistake is treating embedding models as interchangeable. Teams grab text-embedding-ada-002 or a default sentence-transformers model, embed their corpus, and wonder why precision is poor.

Embedding models are trained on specific distributions of text. A general-purpose model trained on web data will encode relationships between common English words and concepts reasonably well. It will encode relationships between medical billing codes, legal contract clauses, or semiconductor manufacturing terminology poorly — because those patterns are underrepresented or absent in its training data.

Why it happens

Speed and convenience. General-purpose models are available via a single API call. Domain-specific alternatives require research, evaluation, and sometimes fine-tuning, none of which show up on a sprint board.

The cost

Retrieval relevance degrades quietly. You won't see a crash. You'll see users rephrasing queries repeatedly, downstream LLM outputs that miss the point, and eventually a perception that "AI search doesn't really work" — which poisons broader adoption.

The fix

Before you embed anything at production scale, run a retrieval evaluation. Build a small test set of 50–100 query/expected-document pairs that represent real user intent. Score candidate models by recall@k (did the right document appear in the top k results?). For specialized domains, evaluate domain-tuned models — Bio-ClinicalBERT for clinical text, legal-specific variants for contract work — against general models on your own data. The right model for your corpus often outperforms the default by 15–40 percentage points on recall, which is a meaningful difference when you're feeding retrieved context into a generation step.

Mistake 2: Chunking Documents Arbitrarily

Chunking — splitting source documents into segments before embedding — is treated as a technical afterthought. It is actually one of the highest-leverage decisions in the entire pipeline.

Why it happens

Most tutorials default to fixed-size chunking (e.g., 512 tokens, hard split). It's simple to implement. The problem is that fixed-size splits routinely cut through sentences, paragraphs, and logical units of meaning. The embedding model then has to represent a fragment that starts mid-argument and ends mid-example.

The cost

Chunks that straddle semantic boundaries produce embeddings that represent nothing coherently. The retrieved chunk lands in the LLM's context window without enough surrounding information to be useful, leading to hallucinated or incomplete answers. This is one of the main contributors to RAG systems that "kind of work" but can't be trusted with anything precise.

The fix

Use semantic chunking strategies:

  • Sentence or paragraph boundaries: Split at natural language boundaries first. Most documents are already structured this way.
  • Section-aware splitting: If your documents have headers or structured sections (legal agreements, technical specs, support articles), split at section boundaries and include the header as metadata or as the opening line of each chunk.
  • Overlapping windows: For dense informational text, a 20–25% overlap between adjacent chunks ensures that context near a boundary isn't lost.
  • Chunk size calibration: Match chunk size to your retrieval context budget. If your LLM prompt allows 8k tokens and you're retrieving 5 chunks, each chunk can be ~1,200 tokens. If you're retrieving 10, they need to be shorter. Calculate this deliberately.

There's no universal correct chunk size. 256–512 tokens works well for factual Q&A over dense reference material; 512–1,024 works better for narrative or explanatory text where context accumulates over several sentences.

Mistake 3: Ignoring Metadata and Filtering

Pure vector similarity is powerful but blunt. It finds semantically similar content across your entire corpus without regard to recency, source authority, access permissions, or categorical relevance.

Why it happens

Metadata filtering requires more schema design upfront and slightly more complex query logic. Teams optimizing for time-to-demo skip it.

The cost

A customer support agent queries for help with a product issue, and the system retrieves a deprecated troubleshooting guide from three years ago — because it's semantically similar to the query. Or an employee query surfaces documents from a department they don't have authorization to access.

The fix

Every chunk should carry structured metadata: document date, source type, author, category, access tier, and any domain-specific tags relevant to your use case. Vector databases like Pinecone, Weaviate, Qdrant, and Chroma all support pre-filtering or post-filtering on metadata. Pre-filtering (restricting the candidate set before similarity search) is generally more reliable for hard constraints like access control. Post-filtering (reranking or discarding after similarity search) works better for soft preferences like recency weighting. Use both in combination when your requirements demand it.

Mistake 4: Embedding Queries and Documents Inconsistently

Semantic search depends on queries and documents living in the same vector space. That sounds obvious, but it breaks in practice more often than you'd expect.

Why it happens

Teams embed their document corpus with one model and then, months later, switch models for cost or performance reasons without re-embedding the corpus. Or they use a fine-tuned model for documents but the base model for queries. Or they change preprocessing (lowercasing, punctuation stripping) inconsistently between index time and query time.

The cost

Cosine similarity scores become meaningless. The system retrieves documents that are geometrically close in one model's space but semantically irrelevant in another. This is extremely difficult to debug because the scores still look reasonable — they're just measuring the wrong thing.

The fix

Treat the embedding model as part of your infrastructure contract, not a configuration option. Version your embedding model alongside your index. When you upgrade models, re-embed and re-index the entire corpus. Never mix embedding models within a single index. Document your preprocessing pipeline and apply it identically at index time and query time. This is operational discipline, not engineering complexity.

Mistake 5: Neglecting Index Maintenance as Your Corpus Evolves

A vector index is a snapshot. Most production use cases involve corpora that change: new documents are added, old ones are revised or retired, and the distribution of content shifts over time.

Why it happens

The initial build feels complete. Index maintenance doesn't show up as a visible feature. Teams set it up once and move on.

The cost

Stale indexes return outdated information with high confidence. Index bloat from accumulated deleted-but-not-removed vectors degrades both query latency and relevance. Over 6–12 months, an unmaintained index on a dynamic corpus can silently degrade to the point of being unreliable. This is the kind of technical debt that's invisible until a user catches a significant error in a consequential context — a compliance issue, a customer-facing mistake, a missed update to critical policy documentation.

The fix

Establish index hygiene as a routine operational process:

  • Upsert on change: When source documents are updated, re-embed and upsert (update or insert) the affected chunks. Don't just add new versions alongside old ones.
  • Hard deletes: Remove vectors associated with retired or invalidated documents. Most vector databases support this; make it part of your document lifecycle management.
  • Scheduled audits: Periodically sample retrieved results for active query patterns and verify they're still accurate and current.
  • Index versioning: Maintain the ability to roll back to a previous index snapshot if a batch update introduces regressions.

Mistake 6: Relying Solely on Cosine Similarity Without Reranking

Vector similarity search is a fast first-pass filter. It is not a precision ranking system. Treating top-k cosine similarity results as the final ranked retrieval list is a consistent underperformer.

Why it happens

It's the default behavior of every vector database. Adding a reranking step requires additional infrastructure or API calls, and its value isn't obvious until you're chasing the last 20 points of precision.

The cost

The most relevant document for a specific query is often not the one with the highest cosine similarity. Cosine similarity captures broad semantic overlap. A short, vague document that loosely matches many queries may consistently outscore a precise, authoritative document that closely matches fewer. This matters most for high-stakes retrieval tasks: legal research, medical information, technical troubleshooting. This is also relevant to broader patterns seen across generative AI failures — precision problems often emerge at the retrieval layer, not the generation layer.

The fix

Add a cross-encoder reranker as a second pass. Cross-encoders (models like Cohere Rerank, or open-source options from the sentence-transformers library) take each query/document pair and score them jointly, which is more computationally expensive but far more accurate than embedding comparison. The standard pattern is: retrieve top-20 or top-50 candidates via vector search, then rerank with a cross-encoder and pass the top-3 to top-5 to your LLM. The latency cost is typically 100–400ms, which is acceptable in most applications and pays for itself in answer quality.

Mistake 7: Skipping Retrieval Evaluation Entirely

Teams build retrieval pipelines and evaluate the end-to-end system — did the LLM give a good answer? — without ever instrumenting the retrieval step independently.

Why it happens

Evaluation infrastructure is unglamorous and doesn't ship features. And because LLMs are good at improvising around retrieval gaps (they'll use their parametric knowledge to fill holes), end-to-end quality can look acceptable even when retrieval is poor.

The cost

You can't improve what you don't measure. Retrieval problems get masked by model capability, then suddenly surface when you change models, expand your corpus, or encounter a query type you haven't seen before. More broadly, understanding where things break is essential to building AI systems you can trust — a theme that runs through how generative AI works at every level.

The fix

Build a retrieval-specific evaluation harness early. You need:

  • A labeled dataset of query/relevant-document pairs (50–200 examples to start, covering your main query categories)
  • Metrics: recall@k (did the right doc appear in top k?), MRR (mean reciprocal rank), and NDCG for ranked relevance
  • A consistent process for updating this test set as your use cases evolve

Run this evaluation whenever you change chunking strategy, swap embedding models, or significantly expand your corpus. It takes a few hours to set up and will surface problems that months of end-to-end testing misses.

Frequently Asked Questions

What is the difference between embeddings and vector search?

Embeddings are numerical representations of text (or other data) that encode semantic meaning as high-dimensional vectors. Vector search is the process of finding stored vectors that are most similar to a query vector, typically using cosine similarity or dot product. They work together: you embed your documents at index time and embed the user's query at search time, then find the closest matches.

How do I choose the right chunk size for my documents?

There's no universal answer, but a practical starting point is 256–512 tokens for dense factual content and 512–1,024 for explanatory or narrative text. The right size depends on your retrieval budget (how many chunks you pass to the LLM), the structure of your source documents, and how self-contained each chunk needs to be to answer a question without surrounding context. Always test with a retrieval evaluation set before committing to a chunk size in production.

Can I mix different embedding models in the same vector index?

No. All vectors in a single index must come from the same model with identical preprocessing. Different models produce incomparable vector spaces, so cosine similarity between a vector from Model A and a vector from Model B is meaningless. If you upgrade your embedding model, re-embed and rebuild your entire index.

What vector database should I use?

The right choice depends on your scale, infrastructure, and latency requirements. Pinecone is a fully managed option with low operational overhead. Weaviate and Qdrant offer strong self-hosted or cloud options with good metadata filtering. PGVector works well if you're already on PostgreSQL and your scale doesn't demand a dedicated vector store. Evaluate based on your query volume, index size, filtering needs, and team's operational capacity — not on what's trending.

How often should I re-embed my corpus?

Whenever your source documents change materially. For static corpora, once is enough. For dynamic content — product catalogs, policy documents, support knowledge bases — establish a synchronization process triggered by document updates. Additionally, re-embed fully whenever you change your embedding model or chunking strategy.

Is reranking always necessary?

Not always, but often. For simple, high-volume queries where precision requirements are moderate (e.g., e-commerce search), vector similarity alone may be sufficient. For high-stakes or complex queries where the difference between the right document and a plausible-but-wrong document matters — technical support, legal, compliance, medical — reranking delivers meaningful improvements and the latency cost is usually justified.

Key Takeaways

  • Match your embedding model to your domain — general-purpose models underperform significantly on specialized corpora.
  • Chunk at semantic boundaries, not arbitrary token counts. Chunk size should be calibrated to your retrieval context budget.
  • Attach structured metadata to every chunk and use it for filtering, especially when recency, access control, or source authority matters.
  • Never mix embedding models within a single index. Treat the model and its preprocessing as infrastructure with versioning requirements.
  • Build index maintenance into your operational process from day one — stale vectors silently degrade relevance over time.
  • Add a cross-encoder reranking pass after vector retrieval for any use case where precision matters.
  • Instrument retrieval independently with recall@k, MRR, and NDCG. End-to-end evaluation masks retrieval failures.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification