Semantic search used to require either expensive custom ML teams or brittle keyword rules that broke the moment a user phrased something differently. Embeddings change that equation entirely. They let you represent meaning as numbers, store those numbers efficiently, and retrieve the most relevant content at millisecond speed — without ever writing a single keyword rule. The catch is that most introductions to this topic are either too abstract to act on or too buried in math to be useful to someone who just wants to build something.
This guide is neither. It walks through the concrete sequence: what embeddings actually are, how to generate them, where to store them, and how to run searches that return genuinely relevant results. If you've read a primer on how generative AI works and want to go deeper on the retrieval layer that powers most production AI systems, this is the next step.
By the end you'll understand not just the mechanics but the failure modes — where systems break, why, and what to do about it. That's the gap most tutorials leave open.
What Embeddings Actually Are
An embedding is a list of floating-point numbers — a vector — that encodes the meaning of a piece of text (or an image, audio clip, or structured record). Two pieces of text that mean similar things end up as vectors that point in similar directions in high-dimensional space. Two pieces that mean very different things point in different directions.
The number of dimensions in a vector depends on the model. OpenAI's text-embedding-3-small produces 1,536-dimensional vectors. Cohere's embed-english-v3.0 produces 1,024-dimensional vectors. Sentence-transformers models range from 384 to 768 dimensions. More dimensions typically means finer-grained distinctions, at the cost of storage and compute.
Why This Matters for Search
Traditional search matches tokens. "Car" doesn't match "automobile" unless someone hand-coded a synonym. Embedding-based search matches meaning. A query for "vehicle maintenance tips" will surface a document titled "How to keep your car running smoothly" — not because of keyword overlap, but because the vectors are geometrically close.
This is the foundation of retrieval-augmented generation (RAG), semantic document search, recommendation systems, and duplicate detection. Understanding embeddings is understanding the retrieval layer that makes most serious AI applications actually work.
Step 1 — Choose Your Embedding Model
This choice shapes every downstream decision, so make it deliberately.
Hosted API Models
- OpenAI `text-embedding-3-small`: Low cost (~$0.02 per million tokens), solid general-purpose quality, easy to integrate. A reasonable default.
- OpenAI `text-embedding-3-large`: Higher quality, roughly 5× the cost. Worth it when retrieval accuracy is business-critical.
- Cohere Embed v3: Strong multilingual support and a useful compression option (binary embeddings) that cuts storage dramatically.
- Google `text-embedding-004`: Competitive on benchmarks, integrates naturally if you're already in the Google Cloud ecosystem.
Open-Source / Self-Hosted Models
- `all-MiniLM-L6-v2`: 384 dimensions, fast, free, runs on CPU. Good for prototyping or high-volume applications where API costs matter.
- `bge-large-en-v1.5` (BAAI): Consistently strong on retrieval benchmarks, 1,024 dimensions.
- `nomic-embed-text-v1.5`: 768 dimensions, supports variable-length context up to 8,192 tokens, Apache 2.0 license.
The Rule: Match Model to Retrieval Task
A model trained for semantic similarity isn't automatically the best for asymmetric search (short query, long document). Check whether the model was trained with a distinction between query and passage encoding. Cohere and the bge family support this explicitly with instruction prefixes.
Step 2 — Prepare and Chunk Your Content
Embedding quality collapses if you feed the model poorly structured input. This step is where most teams underinvest.
Why Chunking Matters
Embedding models have token limits — typically 512 to 8,192 tokens depending on the model. More importantly, embedding an entire 10,000-word document into a single vector averages out all its meaning. You lose specificity. A search for "refund policy" shouldn't have to compete with the other 9,800 words in your terms-of-service document.
Chunking Strategies
- Fixed-size chunks: Split every N tokens (256–512 is common). Simple, predictable. Works poorly when ideas span chunk boundaries.
- Sentence-window chunking: Embed individual sentences but retrieve the surrounding 2–3 sentences for context. Better for dense prose.
- Recursive character splitting: Split on paragraphs first, then sentences, then characters as fallback. LangChain's
RecursiveCharacterTextSplitteris a standard implementation. - Semantic chunking: Use a fast embedding model to detect topic shifts and split there. More accurate, more compute.
Overlap
Add 10–20% token overlap between adjacent chunks. This prevents a concept that spans a boundary from being invisible to both sides of the split.
Metadata
Attach metadata to every chunk: source document ID, chunk index, section heading, creation date, author, content type. You'll use this for filtering later. Treat it like a database schema — design it before you generate a single embedding.
Step 3 — Generate and Store Embeddings
With clean chunks and chosen model, generation is mostly an engineering problem.
Generation in Practice
Call your embedding API in batches, not one chunk at a time. Most APIs accept 100–2,048 inputs per request. Rate limits and per-request overhead make serial calls 10–50× slower than batched calls on large corpora. Log your token counts; embedding costs at scale are real.
For self-hosted models, use the sentence-transformers library with batch encoding:
pythonfrom sentence_transformers import SentenceTransformer model = SentenceTransformer("BAAI/bge-large-en-v1.5") embeddings = model.encode(chunks, batch_size=64, show_progress_bar=True)
Normalize your vectors to unit length if you plan to use cosine similarity. Most hosted models return normalized vectors already; self-hosted models often don't.
Choosing a Vector Store
| Store | Best for | Notes | | ------------ | ------------------------------- | -------------------------------------------------- | | Pinecone | Managed, production scale | Fully hosted, strong filtering, pay-per-use | | Weaviate | Hybrid search + structured data | Self-hosted or cloud, GraphQL interface | | Qdrant | High performance, open-source | Rust-based, excellent filtering performance | | pgvector | Teams already using Postgres | Lower operational overhead, good up to ~1M vectors | | Chroma | Local prototyping | Zero infrastructure, not production-ready at scale | | FAISS | In-memory, maximum speed | Meta's library, no built-in persistence, DIY infra |
For most agencies building their first production system, Qdrant or pgvector is the right call — capable, cost-effective, and not locked into a managed service vendor.
Step 4 — Run Your First Vector Search
Vector search finds the K nearest neighbors to a query vector using an approximate nearest neighbor (ANN) index.
The Basic Retrieval Loop
- Take the user's query string.
- Embed it with the same model used to embed your documents. Using different models is a common and silent failure mode.
- Send the query vector to your vector store with a top-K parameter (10–20 is a reasonable starting range).
- Receive ranked results with similarity scores.
- Pass results and the original query to your LLM, or display them directly.
Similarity Metrics
- Cosine similarity: Measures the angle between vectors. Ignores magnitude. Most common and appropriate for text.
- Dot product: Faster, equivalent to cosine if vectors are normalized.
- Euclidean distance (L2): Sensitive to vector magnitude. Rarely the right choice for text.
ANN Algorithms
Vector stores use approximate search for speed. HNSW (Hierarchical Navigable Small World) is the dominant algorithm — it builds a multi-layer graph structure that achieves recall rates of 95–99% at a fraction of exact-search compute cost. You control the trade-off with parameters like ef_construction (index quality) and ef (query recall). Higher values = better recall, slower queries.
Step 5 — Add Filtering and Hybrid Search
Pure vector search is powerful but incomplete. Metadata filtering and keyword search fill the gaps.
Metadata Filtering
If a user searches "refund policy" but you only want results from documents published after a certain date, or from a specific product line, pre-filter by metadata before running ANN. Most vector stores support this natively. Design your metadata schema in Step 2 with this in mind.
Hybrid Search
Hybrid search combines dense (vector) retrieval with sparse (keyword) retrieval and merges the ranked lists. Sparse retrieval catches exact terminology — product codes, proper nouns, acronyms — that dense retrieval often misses because the embedding model has never seen them.
The standard approach is Reciprocal Rank Fusion (RRF): give each result a score of 1 / (rank + 60) from each retrieval method, sum the scores, re-rank. Simple, parameter-light, and consistently competitive with more complex fusion methods.
Weaviate and Qdrant support hybrid search natively. For pgvector, combine with pg_trgm or a full-text search column. This pattern integrates naturally with building repeatable LLM workflows where retrieval is one defined stage in a larger pipeline.
Step 6 — Evaluate and Iterate
An embeddings system you haven't measured is an embeddings system you don't actually understand.
What to Measure
- Recall@K: Of the K results returned, what fraction were genuinely relevant? Requires a labeled test set.
- Mean Reciprocal Rank (MRR): How highly ranked is the first relevant result?
- Latency: P50 and P99 query latency under load. ANN indexes degrade under concurrent queries in ways average latency hides.
- Embedding coverage: What percentage of your corpus is actually indexed?
Building a Test Set
Sample 50–200 real queries. Label which documents are relevant for each. Even a small labeled set reveals whether your chunking strategy is fragmenting meaning or your model is weak on domain-specific language.
Common Failure Modes
- Wrong chunk size: Too large averages out meaning; too small loses context.
- Domain mismatch: A general-purpose model struggles on legal, medical, or highly technical content. Fine-tuning or domain-specific models help.
- Query-document asymmetry: Short queries against long passages require asymmetric training or instruction-tuned models.
- Stale index: Documents update; embeddings don't automatically re-generate. Build an update pipeline from day one.
This evaluation mindset applies broadly across AI system design — it's the same discipline described in how generative AI works at a practical level.
Step 7 — Connect to a Larger System
Embeddings and vector search rarely sit alone. They're almost always the retrieval layer inside a RAG pipeline, an agent tool, or a recommendation feature.
RAG Integration
Pass your top-K retrieved chunks as context to an LLM. Constrain the LLM to answer only from the retrieved context if grounding accuracy matters. Monitor when the model ignores retrieved context and answers from parametric memory — this is a hallucination risk that vector search alone can't solve.
Agent Tool Use
In agentic architectures, vector search is a tool the agent can call. Design the tool interface to accept both a query string and optional metadata filters, so the agent can scope searches dynamically based on conversation state. This connects directly to patterns in the evolving landscape of large language model applications.
Re-ranking
For high-stakes retrieval, add a cross-encoder re-ranker after ANN retrieval. Cross-encoders score each (query, document) pair jointly — far more accurate than cosine similarity, but too slow to run against your full corpus. Run ANN to get 20–50 candidates, then re-rank to get your final top-5. Cohere's Rerank API and the cross-encoder/ms-marco models are standard options.
Frequently Asked Questions
How is vector search different from full-text search?
Full-text search matches on tokens and their statistical frequency (BM25 is the standard algorithm). Vector search matches on the geometric proximity of meaning representations. Full-text search excels at exact terminology; vector search excels at paraphrase and conceptual similarity. Production systems often combine both.
Do I need a dedicated vector database, or can I use Postgres?
Postgres with the pgvector extension handles millions of vectors adequately for most agency-scale applications. A dedicated vector database like Qdrant or Pinecone adds better ANN index tuning, built-in hybrid search, and horizontal scaling — worthwhile when you're beyond a few million vectors or need sub-10ms P99 latency.
How do I keep my vector index up to date when documents change?
Build an update pipeline alongside your ingestion pipeline. Track document version hashes; when a document changes, delete its old vectors by document ID and re-embed the new version. Most vector stores support filtered deletes. Neglecting this is the most common operational failure in production embeddings systems.
What embedding model should I start with?
For most English-language use cases, OpenAI's text-embedding-3-small is the pragmatic starting point: reliable, well-documented, and cheap enough that cost isn't a constraint at typical agency volumes. Move to a domain-specific or higher-dimensional model only after you've measured and confirmed it's the bottleneck.
How many chunks should I retrieve (what's the right K)?
Start with K=10 and measure. Lower K reduces noise but risks missing the right answer. Higher K increases recall but adds token cost and can dilute LLM context quality. The optimal value depends on your chunk size, document diversity, and whether you're re-ranking. Treat it as a tunable parameter, not a fixed constant.
Can embeddings handle languages other than English?
Multilingual models like Cohere Embed v3, multilingual-e5-large, and paraphrase-multilingual-mpnet-base-v2 support 50–100+ languages in a shared vector space, meaning a query in French can retrieve a document in English if the meaning matches. Quality varies by language; evaluate on your specific language pairs before committing to a multilingual model over a specialized monolingual one.
Key Takeaways
- Embeddings encode meaning as vectors; vector search finds semantically similar content by measuring geometric distance between those vectors.
- Model choice, chunking strategy, and metadata schema are design decisions that compound — getting them right early saves significant rework.
- Always embed queries with the same model used to embed documents; mismatched models silently destroy retrieval quality.
- Hybrid search (dense + sparse) consistently outperforms pure vector search when your corpus contains exact terminology, proper nouns, or specialized codes.
- ANN indexes trade a small amount of recall for large speed gains; tune
efand related parameters when recall metrics fall below acceptable thresholds. - Add a re-ranker for high-stakes retrieval; cross-encoders are slower but far more accurate than cosine similarity for final result ordering.
- Measure Recall@K and MRR against a labeled test set before declaring your system production-ready — subjective impressions of quality are unreliable.
- Plan your index update pipeline on day one; stale embeddings are a silent, compounding problem in production.