Embeddings and vector search are the plumbing behind a wide class of AI products—semantic search engines, RAG pipelines, recommendation systems, duplicate detection tools—and yet most teams build them ad hoc. They choose an embedding model during a late-night prototype session, pick a vector database because someone saw it on Twitter, and then hand off a system that nobody else fully understands. When retrieval quality degrades six weeks later, nobody knows where to start debugging.
This article turns that mess into a documented, repeatable process. The goal is a workflow you can hand to a new team member on a Tuesday and have them running productively by Thursday—without losing the nuance that separates a working system from a good one. The payoff is not just better retrieval quality; it's operational control: the ability to test changes, catch regressions, and improve systematically instead of guessing.
Whether you are building a knowledge base search tool, wiring up a retrieval-augmented generation (RAG) system, or helping a client make sense of their content library, the same core stages apply. The specifics shift; the structure does not.
What Embeddings Actually Do (and Why the Workflow Starts Here)
An embedding is a list of numbers—typically 256 to 3,072 floating-point values—that represents the meaning of a piece of text in a geometric space. Text with similar meaning lands close together in that space; text with different meaning lands far apart. Vector search exploits this geometry: instead of asking "does this document contain these keywords?", it asks "is this document close to this query in meaning-space?"
That distinction is the entire value proposition. It also creates the first workflow decision: what counts as "similar" for your use case?
Similarity is domain-specific
A general-purpose embedding model trained on web text will treat "attorney" and "lawyer" as close, which is correct. It may also treat "interest rate" and "interest in the topic" as closer than you want, because in general web text those phrases appear in similar contexts. For a financial services client, that ambiguity is a retrieval bug. Knowing this upfront changes which model you choose and how you evaluate results.
Stage 1: Define the Retrieval Task Before Touching Any Code
The most common failure mode in embeddings and vector search workflow is skipping task definition. Teams embed all their content first and ask "what can we do with this?" afterward. That produces systems optimized for nothing in particular.
A proper task definition answers four questions:
- Query type: What does a real user actually submit? Free-form natural language? Short keyword phrases? Structured fields? A mix?
- Document type: What are you retrieving? Paragraphs, whole pages, product records, support tickets? Do documents vary wildly in length?
- Relevance criteria: What makes a result "good"? Topical overlap? Exact-match on a key field? Recency? Combinations?
- Latency and scale constraints: How many documents? How many queries per second? What is the acceptable response time?
Write these down. One page is enough. This document becomes the acceptance criteria for every subsequent decision.
Stage 2: Chunk and Prepare Your Source Content
Embedding models have token limits—typically 512 to 8,192 tokens depending on the model. More importantly, retrieval quality degrades when you embed very long passages, because the embedding averages over more content and becomes less specific.
The chunking decision
Common strategies, with their trade-offs:
- Fixed-size chunks (e.g., 256–512 tokens with overlap): Predictable, easy to implement, works well for prose-heavy corpora. Overlap (typically 10–20% of chunk size) prevents splitting a key sentence across two chunks.
- Semantic chunking (split at natural boundaries): Better quality, harder to implement consistently. Split on paragraphs, headings, or sentence boundaries rather than raw token counts.
- Document-level embedding: Works when documents are short and self-contained (product descriptions, FAQ entries). Breaks down for long documents.
Whatever strategy you choose, assign each chunk a stable, unique ID and store the source metadata (document ID, URL or file path, section heading, date) alongside it. You will need this metadata for filtering, attribution, and debugging.
Preprocessing checklist
- Strip boilerplate (headers, footers, nav text, legal disclaimers that appear on every page)
- Normalize whitespace and encoding
- Decide whether to keep or remove markdown/HTML tags (model-dependent)
- Log chunk counts per source document—unexpected drops signal parsing errors
Stage 3: Choose and Lock Your Embedding Model
Model choice is one of the most consequential decisions in an embeddings and vector search workflow, and it needs to be treated like infrastructure: once you commit, changing it means re-embedding everything and re-evaluating retrieval quality from scratch.
Evaluation criteria
| Criterion | Why it matters | | ---------------------------- | ---------------------------------------------------------------------------------------------------------------------- | | Benchmark performance (MTEB) | Gives a starting point. MTEB's retrieval tasks are the most relevant subset. | | Token limit | Must accommodate your largest chunk | | Embedding dimension | Higher dimensions improve quality up to a point; they increase storage and compute costs | | Cost per token | Matters at scale—embedding a large corpus once is cheap; re-embedding it monthly is not | | Hosting model | API (OpenAI, Cohere, Voyage) vs. self-hosted (sentence-transformers) affects latency, data privacy, and vendor lock-in |
Run at least 50–100 representative query-document pairs through your top two or three candidates before committing. Score them for relevance manually or with a small annotation team. A model that ranks 10th on MTEB might outperform the top-ranked model on your specific domain.
For teams learning to think clearly about model behavior more broadly, How Generative AI Works: Myths vs Reality covers common misconceptions that affect model selection decisions.
Stage 4: Build and Populate the Vector Index
A vector index stores your embeddings and serves approximate nearest-neighbor (ANN) queries. The major hosted options (Pinecone, Weaviate Cloud, Qdrant Cloud) and open-source/self-hosted options (Qdrant, Chroma, pgvector in Postgres) all support the core workflow. Pick based on your operational preferences, not on which has the most aggressive developer marketing.
Index configuration decisions
- Distance metric: Cosine similarity is the default for most text embedding models. Dot product is faster but requires normalized vectors (confirm with your model's documentation). Euclidean distance is generally not recommended for text embeddings.
- Index type: Most production ANN indexes (HNSW is the dominant algorithm) have parameters that trade recall for speed. Start with library defaults; tune only after profiling.
- Metadata filtering: Decide upfront which metadata fields you will filter on (date range, category, source, language). Many vector databases support pre-filtering, which is significantly faster than post-filter on large collections.
- Namespaces or collections: If you're serving multiple clients or content domains, isolate them in separate namespaces from day one. Retrofitting isolation into a shared index is painful.
Populate the index in batches (typically 100–500 documents per request) with retry logic and idempotent upsert behavior. Log embedding time, batch success rates, and total document count. These logs catch silent failures that cost you retrieval coverage.
Stage 5: Implement and Test the Query Pipeline
Retrieval is not just "send the query to the index." A production-ready query pipeline includes at least five components:
- Query preprocessing: Same normalization you applied to documents—sometimes more. Decide whether to expand queries (add synonyms or context), rewrite them, or pass them raw.
- Embedding the query: Use the identical model and preprocessing as the document side. This sounds obvious; teams get it wrong by updating the document embedding model without updating the query path.
- ANN search: Retrieve top-K candidates. K is typically 5–20 for RAG use cases, higher for recommendation systems. Err toward larger K; you can always trim downstream.
- Reranking (optional but high-value): A cross-encoder reranker (e.g., Cohere Rerank, a local sentence-transformers cross-encoder) scores the top-K candidates against the query with much higher accuracy than the initial vector search. For most RAG pipelines, adding a reranker improves answer quality noticeably—often more than switching embedding models.
- Result assembly: Return chunks with their metadata, relevance scores, and source attribution. Never return raw chunk text without source context in a user-facing application.
Document this pipeline as a diagram or a numbered spec. Every person who works on the system should be able to read it in under five minutes. For a broader view of how retrieval fits into generative AI systems, The How Generative AI Works Playbook is a useful companion.
Stage 6: Evaluate Retrieval Quality Systematically
This stage is where most teams are underdeveloped. "It seems to work" is not an evaluation methodology.
Build a golden test set
A golden test set consists of 50–200 (query, relevant document IDs) pairs that represent your actual use case. Sources for golden pairs:
- Real search logs with clicks or positive feedback
- SME annotation sessions (structured 2-hour sessions with domain experts)
- Synthetic generation using an LLM to create plausible queries for sampled documents—useful to bootstrap but should be validated by humans
Metrics to track
- Recall@K: Of the truly relevant documents, what fraction appears in the top-K results? This is the primary metric for most retrieval use cases.
- MRR (Mean Reciprocal Rank): How high does the first relevant result appear? Useful when rank order matters.
- NDCG: Weighted by rank position; appropriate when you have graded relevance judgments (not just binary).
Run these metrics every time you change the embedding model, chunking strategy, or index configuration. Treat a regression of more than 2–3 percentage points in Recall@K as a blocking issue before any deployment.
Understanding evaluation rigor in this context connects to broader questions about AI reliability—The Hidden Risks of How Generative AI Works (and How to Manage Them) addresses the organizational side of that problem.
Stage 7: Document the System for Handoff
A workflow only becomes repeatable when the documentation exists outside the original builder's head.
Your system document should include:
- The task definition from Stage 1 (the four questions)
- Chunking strategy with the rationale and any edge cases discovered
- Embedding model name, version, and API endpoint or local path
- Index configuration (database, collection name, distance metric, metadata schema)
- Query pipeline diagram or numbered spec
- Evaluation results against the golden test set, with the date they were run
- Known limitations (query types that perform poorly, document types that are excluded, language coverage)
- Runbook for re-indexing, updating the embedding model, and scaling the index
One to three pages is enough for most systems. The goal is that a competent engineer who did not build this system can operate, debug, and improve it within a day. This principle—that AI work product should be hand-off-able—applies broadly, as covered in Building a Repeatable Workflow for How Generative AI Works.
Frequently Asked Questions
How do I know which embedding model to choose?
Start with the MTEB leaderboard's retrieval subtasks as a shortlist, then run your own evaluation on 50–100 representative query-document pairs from your actual domain. General benchmarks predict relative performance but not absolute suitability for your specific content type or query style. When in doubt, run two candidates head-to-head before committing—switching later costs the price of re-embedding your entire corpus.
What chunk size should I use?
For most prose corpora, 256–512 tokens with a 10–20% overlap is a reasonable starting point. Shorter chunks improve retrieval precision but reduce context; longer chunks improve context but reduce precision. The right answer depends on your document structure and query type—test two or three configurations against your golden test set rather than treating any default as correct.
Can I use the same embedding model for both indexing and querying?
Yes—and you must. Using different models for document embedding and query embedding produces incoherent similarity scores. If you update the embedding model for new documents, you must also re-embed your existing corpus and update the query path simultaneously.
How often should I re-index my content?
It depends on how frequently your source content changes. For mostly-static corpora, monthly re-indexing with incremental upserts for new content is common. For high-velocity content, implement event-driven upserts triggered by content publication or update events. The more important question is whether you have logging in place to detect when indexed content is stale.
When does a reranker actually help?
A reranker helps most when your first-stage retrieval is handling diverse or ambiguous queries, when the embedding model's training distribution doesn't match your domain well, or when ranking quality matters as much as coverage (e.g., when only the top 3 results are shown to the user). It adds latency (typically 50–200ms for a cross-encoder over 20 candidates), so profile the trade-off against your latency requirements before adding it unconditionally.
What should I do when retrieval quality degrades unexpectedly?
Start by checking three things in order: whether the source content changed (new documents, structural changes to existing ones), whether the query distribution shifted (are users asking different types of questions?), and whether any infrastructure component updated silently (API model versions sometimes change without notice). If all three are stable, re-run your golden test set to confirm the regression is real, then use failing examples to diagnose whether the issue is in chunking, embedding, or ranking.
Key Takeaways
- Define the retrieval task explicitly—query type, document type, relevance criteria, and constraints—before writing any code. This document drives every subsequent decision.
- Chunking strategy and metadata structure are architectural decisions, not implementation details. Get them wrong and no model upgrade fixes the problem.
- Lock your embedding model like infrastructure. Re-embedding a large corpus has real cost; evaluate candidates before committing, not after.
- A golden test set with 50–200 labeled query-document pairs is the minimum viable evaluation setup. Track Recall@K as your primary metric; treat regressions as blocking.
- A reranker is often higher-leverage than a better embedding model for improving result quality in RAG and search applications.
- The workflow is only repeatable if it is documented: task definition, chunking rationale, model version, index configuration, pipeline spec, evaluation results, and a re-indexing runbook.
- Silent failures—dropped batches, stale indexes, model version changes—are the most common production hazard. Instrument and log every stage.