Semantic search used to require a custom machine-learning team, six months of runway, and a tolerance for ambiguity. Today, a competent developer can have a working embeddings pipeline in production within a week. The gap between "this sounds interesting" and "this is running in our product" has collapsed — but the gap between running and working well remains wide. Most teams that struggle with embeddings don't struggle with the technology. They struggle with sequencing: doing the right things in the right order, with the right owners accountable at each stage.
This playbook closes that gap. It treats embeddings and vector search not as a research topic but as an operational discipline with discrete plays, clear triggers, and defined handoffs. Whether you're building a semantic search feature for clients, a knowledge retrieval layer for an internal AI assistant, or a recommendation engine for a content-heavy product, the same underlying sequence applies. The specifics shift; the structure doesn't.
Before diving into plays, one framing note: embeddings are not magic retrieval dust you sprinkle on a problem. They are a learned representation of meaning, and every decision downstream — which model produces them, how you chunk your content, how you index and query — compounds. Getting the foundation right matters more than optimizing any single layer.
Play 1 — Define the Retrieval Problem Before Touching a Model
The most expensive mistake teams make is embedding content before they understand what queries will look like. Embeddings optimize for semantic similarity between a query and a document. If your queries are short and your documents are long, the similarity geometry works differently than if both are roughly paragraph-length. If your users ask procedural questions ("how do I...") but your documents are reference-style ("the specification states..."), the cosine distances will mislead you.
The Problem Definition Checklist
Before writing a single line of embedding code, answer these four questions:
- What does a successful retrieval look like? Write three to five real examples of a query paired with the ideal result. If you can't do this, your use case isn't defined yet.
- Who generates the queries? Customers typing natural language, internal staff using jargon, or an LLM generating structured lookups — each has a different query distribution.
- What is the acceptable latency? Vector search at sub-100ms requires different infrastructure than batch retrieval that can run in seconds.
- What is the cost of a bad retrieval? If wrong results produce wrong answers in a customer-facing chatbot, you need tighter precision controls than if this is an internal knowledge-search tool with a human in the loop.
Owner: Product lead or whoever owns the use case. This play must not be delegated to an engineer until the answers exist in writing.
Trigger: Run this before any technical spike.
Play 2 — Choose Your Embedding Model With Intention
There is no universal best embedding model. The right choice depends on your content domain, your latency constraints, and whether you need multilingual support. The practical decision tree has three branches.
General-Purpose vs. Domain-Specific
General-purpose models — OpenAI's text-embedding-3-small, Cohere's embed-english-v3.0, or open-source options like bge-m3 from BAAI — perform well across most professional text use cases. Domain-specific fine-tuned models outperform them on specialized corpora (medical, legal, financial) by meaningful margins, but they require labeled data and maintenance.
For most agency and professional deployments: start general-purpose, measure retrieval quality against your real query-document pairs, and only pursue fine-tuning if you have a documented quality gap and the labeled data to close it.
Dimensionality and Cost Trade-offs
Embedding dimensions typically range from 384 to 3072. Higher dimensions capture more nuance but cost more to store and search. A 1536-dimension embedding index over one million documents is manageable. At ten million documents, the storage and query costs become a genuine budget line. Know your scale before locking in a model.
The Multilingual Question
If your users or content span multiple languages, a multilingual model is not optional — it's a prerequisite. Querying a French phrase against English embeddings produces unreliable similarity scores. Models like multilingual-e5-large or Cohere's multilingual embed handle cross-lingual retrieval reasonably well out of the box.
Owner: Engineering lead, with sign-off from the product lead on the domain requirements gathered in Play 1.
Trigger: After Play 1 is complete. Do not run model comparisons until you have real query-document pairs from the problem definition.
Play 3 — Chunk Your Content for the Query, Not for Convenience
Chunking is where most implementations quietly fail. The default behavior — split every 512 tokens, overlap by 50, ship it — produces mediocre retrieval quality because it optimizes for processing convenience rather than semantic coherence.
Chunking Strategies and When to Use Them
- Fixed-size chunking: Fast to implement, predictable. Works acceptably for homogeneous content (e.g., FAQ entries of similar length). Poor for structured documents where a 512-token window cuts through an argument mid-thought.
- Semantic chunking: Split on meaning boundaries — paragraphs, sections, logical units. Requires more preprocessing but produces dramatically better retrieval for long-form content. If your documents are reports, articles, or contracts, semantic chunking is worth the effort.
- Hierarchical chunking: Embed both a summary chunk and fine-grained child chunks. At query time, retrieve by the summary, then fetch the relevant child. Useful when documents are long and queries are narrow.
- Sentence-window chunking: Embed at the sentence level, but return a surrounding window of sentences as the retrieved context. Helps LLMs that need a coherent passage, not a fragment.
Chunk size interacts with your embedding model's context window. Most models handle up to 512 tokens well; many now support 8,192 tokens. Longer chunks can preserve more context but dilute similarity scores — a two-sentence query against a 2,000-token chunk often loses signal.
Owner: Engineering, informed by the content audit from Play 1.
Trigger: After model selection. Chunking strategy must align with the model's input characteristics.
Play 4 — Index Correctly and Understand the Approximate Trade-off
Vector databases (Pinecone, Weaviate, Qdrant, pgvector, Chroma) don't return exact nearest neighbors at scale — they return approximate nearest neighbors (ANN). This is a deliberate trade-off: exact search over millions of high-dimensional vectors is prohibitively slow; ANN algorithms like HNSW (Hierarchical Navigable Small World) return results in milliseconds with recall rates typically between 90% and 99%, configurable based on your precision-latency requirements.
Choosing a Vector Store
| Scenario | Pragmatic choice | | ---------------------------------------- | ------------------------------------- | | Prototype or small-scale (<100k vectors) | Chroma or pgvector | | Production SaaS, managed infrastructure | Pinecone or Weaviate Cloud | | Self-hosted, full control | Qdrant or Weaviate OSS | | Existing Postgres stack | pgvector with IVFFlat or HNSW indexes |
Metadata Filtering: The Often-Missed Lever
Every document chunk should carry structured metadata: source document ID, content type, date, author, access permissions. At query time, pre-filtering on metadata before vector search dramatically improves precision. "Find similar content to this query from documents published in the last 90 days" is far more useful than an unfiltered semantic sweep. Build your metadata schema before you ingest. Retrofitting it is painful.
Owner: Infrastructure or backend engineering.
Trigger: After chunking strategy is defined. Index schema depends on chunk metadata design.
Play 5 — Implement Hybrid Search Before Declaring the System Good
Pure vector search handles semantic similarity well. It handles exact term matching poorly. A user searching for a specific product code, a person's name, or a regulatory clause number gets better results from BM25 keyword search than from cosine similarity. The best production systems combine both.
Hybrid search runs a vector query and a keyword query in parallel, then merges the result lists using a reranking step — typically Reciprocal Rank Fusion (RRF) or a learned reranker model. In practice, hybrid search outperforms either approach alone across a wide range of query types, and the implementation overhead is manageable in most modern vector databases that support it natively.
This is also the right moment to introduce a reranker if your use case involves feeding retrieved chunks into an LLM. Cross-encoder rerankers (like Cohere Rerank or open-source cross-encoder/ms-marco-MiniLM-L-6-v2) re-score the top-k retrieved results with higher accuracy than the bi-encoder similarity used for initial retrieval. The cost is latency — reranking 50 candidates takes additional time. Budget for it.
Owner: Engineering. Product lead should be looped in to understand precision-recall trade-offs.
Trigger: After basic vector retrieval is working and you have a baseline evaluation.
Play 6 — Evaluate with Real Queries, Not Vibes
"It seems to be returning relevant results" is not an evaluation methodology. You need a retrieval evaluation harness before you ship and before you iterate.
Building Your Evaluation Set
Gather 50 to 200 real or representative queries. For each, manually identify the ideal retrieved chunk(s). This is your ground truth. Then measure:
- Recall@k: Of the ideal chunks, how many appear in the top-k retrieved results?
- MRR (Mean Reciprocal Rank): How highly ranked is the first relevant result?
- Precision@k: Of the top-k returned results, what fraction are relevant?
You don't need a sophisticated framework to start. A spreadsheet with queries, expected results, and retrieved results, scored manually, tells you whether you're making progress or just making changes.
Owner: Engineering runs the harness; product lead or domain expert validates relevance judgments.
Trigger: Before and after every significant change to chunking, model, or indexing strategy.
Play 7 — Operationalize: Monitoring, Refresh, and Access Control
A vector index is not a one-time artifact. Content changes, embedding models get updated, and query distributions shift. Treating the index as static is how systems degrade invisibly.
The Operational Checklist
- Stale content detection: When a source document is updated or deleted, the corresponding chunks must be re-embedded and re-indexed. Build a content-change webhook or scheduled diff job from day one.
- Model versioning: If you update your embedding model, you must re-embed the entire corpus. Mixing embeddings from different model versions in the same index produces nonsensical similarity scores. Version your index alongside your model.
- Access control at retrieval time: If different users should see different content subsets, enforce permissions as metadata filters at query time. Do not rely on post-retrieval filtering — it leaks information about document existence. This connects directly to the risks covered in The Hidden Risks of How Generative AI Works (and How to Manage Them).
- Query logging: Log queries and top retrieved results (with user consent where required). This corpus is your most valuable source of ground truth for future evaluation and fine-tuning.
Owner: Engineering owns the operational infrastructure. Legal or compliance owns the access control sign-off.
Trigger: Before go-live. These are not post-launch improvements — they are launch criteria.
Play 8 — Integrate with the Broader AI Stack
Embeddings and vector search rarely operate in isolation. In most agency and professional deployments, they serve as the retrieval layer for a Retrieval-Augmented Generation (RAG) system, feeding retrieved chunks as context to an LLM. The retrieval quality ceiling determines the generation quality ceiling. A better prompt cannot compensate for wrong retrieved documents.
As you build this integration, align with how your team understands the generative layer. If team members are still building intuitions about how LLMs process context, Rolling Out How Generative AI Works Across a Team is a useful companion for the change management side. And if you're encountering skepticism about whether retrieval-augmented systems are meaningfully different from "just asking ChatGPT," How Generative AI Works: Myths vs Reality addresses the underlying misconceptions directly.
The integration handoff point — where retrieved chunks become LLM context — deserves its own documented interface: how many chunks, what format, what metadata gets passed, and how conflicting information across chunks is handled. Leaving this implicit produces inconsistent behavior that's hard to debug.
Owner: Engineering, with alignment from whoever owns the generative AI layer.
Trigger: After retrieval evaluation shows acceptable performance. Don't integrate a weak retrieval layer into generation — you'll conflate retrieval failures with generation failures.
Frequently Asked Questions
What's the difference between embeddings and traditional keyword search?
Keyword search matches on exact or stemmed terms; it doesn't understand meaning. Embeddings represent meaning as a point in high-dimensional space, so a query for "vehicle maintenance" can retrieve documents that use "car servicing" or "fleet upkeep" without those terms appearing in the query. The trade-off is that keyword search is more precise for exact matches, which is why hybrid approaches outperform either method alone.
How many documents do I need before embeddings are worth the effort?
There's no hard floor, but the ROI on semantic search increases with corpus size and query diversity. For a knowledge base under a few hundred documents with predictable queries, a well-structured keyword search often performs comparably at lower cost. Once your corpus exceeds a few thousand documents or your queries are open-ended and unpredictable, the case for embeddings becomes clear.
Can I use the same embedding model for queries and documents?
Yes, and for most bi-encoder retrieval systems, you must. The embedding space is model-specific — a query embedded by Model A cannot be meaningfully compared to a document embedded by Model B. Some specialized models use asymmetric encoding (different encoders for queries vs. documents) to optimize each direction, but they are explicitly designed for this and document it clearly.
How do I handle embedding model updates without breaking my index?
Version your index. When you update your embedding model, treat re-embedding as a migration: spin up a new index with the new model, re-embed the full corpus, validate retrieval quality against your evaluation set, then cut over. Never mix embeddings from different model versions in the same index. Keep the old index live until the new one passes validation.
What's the biggest operational mistake teams make post-launch?
Treating the vector index as static. Content changes, but the index doesn't unless you build the update pipeline. This produces a system that confidently retrieves stale or deleted information. The fix is straightforward — a content-change detection mechanism and re-indexing pipeline — but teams routinely skip it because it's not exciting to build. It becomes very exciting to explain to a client why the AI is citing a policy that was retired eight months ago.
Is vector search the same as semantic search?
Vector search is the underlying mechanism; semantic search is the application goal. Semantic search means retrieval based on meaning rather than keywords. Vector search — finding nearest neighbors in embedding space — is the dominant implementation technique for semantic search today. You can have vector search without meaningful semantics (if your embeddings are poor) and you can have rough semantic search without vectors (via older techniques like LSA). In practice, the terms are used interchangeably in most professional contexts.
Key Takeaways
- Define the retrieval problem — real query examples, user type, latency budget, and failure cost — before selecting any model or tool.
- Embedding model choice should match your domain, language requirements, and scale. Start general-purpose; fine-tune only when you have documented quality gaps and labeled data.
- Chunking strategy is a primary driver of retrieval quality. Optimize chunk boundaries for semantic coherence, not processing convenience.
- Metadata filtering at query time is the most underused precision lever in most production deployments.
- Hybrid search (vector + keyword) outperforms pure vector retrieval across most real-world query distributions. Add a reranker if you're feeding results to an LLM.
- Evaluate with a structured query set and recall/precision metrics before and after every material change. Vibes are not a measurement.
- Build content refresh, model versioning, and access control into your launch criteria, not your backlog.
- The retrieval quality ceiling is the generation quality ceiling. Fix retrieval before optimizing prompts.