Stop Hand-Waving at Embeddings and Pay Less for It

Embeddings and vector search are the plumbing behind most serious AI applications you actually care about—semantic search, retrieval-augmented generation, recommendation systems, duplicate detection, and more. Yet for many professionals, they remain a vague hand-wave: something about numbers and similarity. That vagueness is expensive. If you don't understand how embeddings work and when vector search is the right tool, you'll make poor architectural choices, overpay for infrastructure, and build systems that underperform in ways you can't diagnose.

This guide closes that gap. It covers the mechanics of embeddings, how vector search works at scale, which tools and trade-offs matter in practice, and how to connect these ideas to real production decisions. If you've read an introductory explainer on how generative AI works and want to go deeper on one of its most important substrate technologies, this is the next step.

No prior expertise in linear algebra or machine learning is required. You do need to care about building things that work.

What Embeddings Actually Are

An embedding is a list of numbers—a vector—that represents the meaning of something. That something can be a word, a sentence, a paragraph, a product description, an image, a user's click history, or almost any discrete object you can describe to a model. The list of numbers might have 384 entries, 768, 1,536, or 3,072, depending on which model produced it.

The critical property isn't the numbers themselves. It's that the geometry of the vector space encodes semantic relationships. Things that mean similar things end up near each other in that space. Things that mean different things end up far apart. "Dog" and "puppy" land close together. "Dog" and "invoice" land far apart. This isn't hard-coded—it emerges from training on vast amounts of text (or images, or audio), during which the model learns representations that compress meaning into coordinates.

How Models Learn to Embed

Embedding models are trained with objectives that force semantically related inputs to produce similar vectors. The most common approaches:

Contrastive learning: pairs of similar items are trained to pull their vectors together; dissimilar pairs push them apart. OpenAI's text-embedding models and models like Sentence-BERT use variants of this.
Masked language modeling as a precursor: models like BERT produce contextual embeddings as a byproduct of learning to predict masked tokens. These are fine-tuned for retrieval tasks with additional training.
Bi-encoder architecture: two separate encoder passes—one for the query, one for the document—produce vectors that can be compared efficiently at inference time. This is distinct from cross-encoders, which process both inputs jointly and are more accurate but far slower.

Dimensionality and Trade-offs

Higher-dimensional embeddings carry more information but cost more to store and search. A vector of 1,536 floats at 4 bytes each is about 6 KB per item. At 10 million documents, that's roughly 60 GB just for the vectors before any index overhead. Choosing dimensionality is a real engineering decision, not a default to accept blindly. Several modern models now offer Matryoshka Representation Learning (MRL), which lets you truncate embeddings to smaller sizes with graceful accuracy degradation—useful when you want to trade precision for speed or storage.

The Distance Metrics That Define Similarity

Once you have vectors, "similarity" means measuring distance between them. Three metrics dominate:

Cosine similarity: measures the angle between two vectors, ignoring magnitude. This is the default for most text search because it captures directional meaning regardless of vector length. Values range from –1 to 1; in practice, most semantic similarity scores fall between 0.7 and 1.0 for genuinely similar items.
Dot product: similar to cosine but sensitive to magnitude. Useful when the model is specifically trained to use magnitude as a relevance signal. Many recent OpenAI embedding models are optimized for dot product comparisons.
Euclidean distance (L2): straight-line distance in the vector space. Appropriate for certain image embedding tasks but less common for text.

The choice of metric should match how the embedding model was trained. Using the wrong metric degrades search quality and produces confusing, hard-to-diagnose failures.

How Vector Search Works at Scale

Comparing a query vector against every stored vector—brute-force search—is exact and simple but doesn't scale. At a million vectors, it becomes slow; at hundreds of millions, it becomes untenable for real-time applications. Vector search solves this with Approximate Nearest Neighbor (ANN) algorithms that trade a small amount of recall for massive speed gains.

The Main ANN Approaches

HNSW (Hierarchical Navigable Small World): builds a layered graph where each vector connects to its nearest neighbors. At query time, the algorithm navigates the graph to find approximate nearest neighbors quickly. HNSW offers excellent recall-speed trade-offs and is the default index type in most modern vector databases. Memory-intensive but very fast.

IVF (Inverted File Index): clusters vectors into buckets (Voronoi cells) using k-means. A query vector identifies its nearest clusters and searches only within them. More memory-efficient than HNSW at large scale but requires tuning the number of clusters and how many to probe at query time.

PQ (Product Quantization): compresses vectors by splitting them into sub-vectors and replacing each with a centroid code. Dramatically reduces memory at the cost of recall accuracy. Often combined with IVF as IVF-PQ for billion-scale datasets.

ScaNN (Google's algorithm): uses anisotropic quantization to prioritize accuracy for high-similarity vectors. Excellent benchmark performance; available in the scann Python library.

Recall vs. Latency: The Fundamental Trade-off

Every ANN index has tunable parameters (like HNSW's ef_search or IVF's nprobe) that trade recall for latency. A reasonable production target is 95–99% recall at under 20ms p95 latency for most use cases. Getting from 99% to 99.9% recall often doubles query time. Know your acceptable recall floor before you tune.

Vector Databases: What They Add Beyond the Index

A vector database is not just an ANN index. It adds the operational infrastructure you need to actually run a production system:

Metadata filtering: filter by structured attributes (date range, category, user ID) before or after vector search. Without this, you'd return semantically similar results that are irrelevant for business reasons.
Persistence and durability: unlike in-memory libraries like Faiss, vector databases handle restarts gracefully.
CRUD operations: real datasets change. You need to add, update, and delete vectors without rebuilding the entire index.
Hybrid search: combining vector similarity with keyword (BM25) search. In practice, hybrid search outperforms pure vector search for most enterprise retrieval tasks, especially on named entities and short queries.

Major Options and Their Positioning

| Database | Best For | Notable Trade-off | | ------------ | -------------------------------------------- | --------------------------------------- | | Pinecone | Managed, fast start, serverless | Vendor lock-in, costs escalate at scale | | Weaviate | Hybrid search, open-source flexibility | Operational complexity self-hosted | | Qdrant | Rust-based, high performance, good filtering | Smaller ecosystem | | Chroma | Local dev, RAG prototyping | Not designed for large-scale production | | pgvector | Already on Postgres, small-medium scale | Performance ceiling at large datasets | | Milvus | Billion-scale, cloud-native | Complex to operate |

If you're already running Postgres and have fewer than a few million vectors, pgvector removes an entire infrastructure dependency. Start there, move to a dedicated system only when you hit a real ceiling.

Retrieval-Augmented Generation: Why Embeddings Matter for LLMs

Retrieval-Augmented Generation (RAG) is the most common production reason teams need to understand embeddings and vector search. The pattern: embed your knowledge base, store vectors, embed incoming queries, retrieve top-k similar chunks, and inject them into an LLM prompt as context. This is how you give a large language model access to private, current, or specialized knowledge without retraining.

The quality of your RAG system is largely determined by retrieval quality, not the LLM. A better model cannot compensate for retrieving the wrong chunks. The failure modes are predictable:

Chunking is too coarse: chunks contain mixed topics; the retrieved text is noisy.
Chunking is too fine: individual chunks lack enough context to be useful in isolation.
Embedding model mismatch: using a general-purpose model for a specialized domain (medical, legal, code) produces poor semantic alignment.
Missing metadata filters: retrieving semantically similar but wrong-date or wrong-user content.

A practical starting point: 512-token chunks with 10–15% overlap, a strong embedding model (current strong options include OpenAI's text-embedding-3-large and Cohere's embed-v3), and hybrid search. Treat retrieval precision and recall as metrics you actually measure with a labeled eval set, not assumptions you make during setup. If you're building a repeatable LLM workflow, retrieval evaluation should be a checkpoint, not an afterthought.

Choosing and Evaluating Embedding Models

Not all embedding models are equal, and the leaderboard changes often. The MTEB (Massive Text Embedding Benchmark) is the most authoritative public benchmark. It tests models across retrieval, classification, clustering, semantic similarity, and other tasks. A model that ranks well on retrieval specifically is not necessarily the best overall—match the benchmark task to your use case.

Key evaluation considerations:

Domain alignment: a general model trained on web text may underperform on legal contracts or medical notes. Fine-tuning an embedding model on domain-specific data typically produces 10–25% recall improvements in specialized settings.
Language: multilingual models (like multilingual-e5-large) trade some English performance for cross-language capability. If your data is multilingual, this trade is usually worth it.
Context window: most embedding models have a 512-token limit. Some now support 8K tokens. Longer context isn't always better—it can dilute signal—but it matters for long-document tasks.
Cost: embedding 1 million tokens with OpenAI's text-embedding-3-small costs roughly $0.02 at current pricing. At scale, model choice and batching strategy have real financial implications.

Always evaluate on your own data. Leaderboard ranking is a starting point, not a decision.

Beyond Text: Multimodal and Specialized Embeddings

The same machinery works for non-text modalities. CLIP-style models produce image and text embeddings in a shared space, enabling cross-modal search—find images matching a text query, or find text matching an image. This powers reverse image search, e-commerce visual similarity, and content moderation.

Code embeddings (used by tools like GitHub Copilot's retrieval components) are trained on code-text pairs to enable semantic code search. Graph embeddings represent nodes and edges as vectors, enabling similarity queries over relationship structures. User and item embeddings power collaborative filtering in recommendation systems—essentially, every major streaming or e-commerce platform runs a form of vector search to generate recommendations.

The future of large language models is deeply tied to multimodal embedding—unified representations across text, image, audio, and video that enable richer retrieval and reasoning. Understanding embeddings now gives you the conceptual foundation to evaluate those developments as they arrive.

Production Pitfalls and How to Avoid Them

Several failure patterns appear repeatedly in teams shipping their first vector search system:

Index drift: you update documents but forget to re-embed and replace their vectors. The index gradually diverges from ground truth. Build embedding updates into your document ingestion pipeline, not as a separate batch job.

Embedding model version changes: upgrading your embedding model invalidates all existing vectors. They were encoded in one model's geometry; the new model's geometry is different. Plan for re-embedding your entire corpus when upgrading. This is expensive and often overlooked in cost estimates.

Ignoring hybrid search: pure vector search performs poorly on keyword-heavy queries (product codes, names, acronyms). Hybrid search—combining BM25 keyword matching with vector similarity—is almost always better for enterprise data. Most production vector databases support it natively.

No evaluation harness: teams ship RAG systems without a labeled test set. They have no way to know if a configuration change improved or hurt retrieval quality. Build a small (200–500 query) golden dataset early. It will pay off within the first month.

Over-indexing on benchmark performance: the MTEB top model may not outperform a smaller, faster model on your actual data. Measure what matters to your users, not what looks good on a public leaderboard.

Frequently Asked Questions

What's the difference between an embedding and a vector?

These terms are used interchangeably in most practical contexts. Technically, a vector is any ordered list of numbers; an embedding is specifically a vector produced by a model to represent the meaning of an input. Every embedding is a vector, but not every vector is an embedding in the semantic sense.

Do I need a vector database, or can I use Postgres with pgvector?

For datasets under roughly 1–5 million vectors with moderate query volume, pgvector on Postgres is often sufficient and removes an infrastructure dependency. Beyond that threshold—or when you need advanced filtering, billion-scale throughput, or built-in hybrid search—a dedicated vector database is worth the operational overhead.

How do I know if my retrieval quality is good enough?

Build a labeled evaluation set: a collection of queries paired with the correct documents. Measure Recall@k (did the right document appear in the top-k results?) and Mean Reciprocal Rank. Aim for Recall@10 above 80–85% as a baseline for most RAG applications. If you're below that, diagnose chunking strategy, embedding model choice, and whether you're using hybrid search before changing the LLM.

Can I use the same embedding model for both documents and queries?

Yes—and you usually should, unless you're using a cross-encoder or a model explicitly trained with asymmetric query/document pairs. Bi-encoder models are designed to embed both queries and documents in the same space for direct comparison. Using different models for queries and documents will produce vectors in incompatible spaces and return garbage results.

How often do I need to re-embed my data?

Whenever the underlying text changes (re-embed the changed document) or when you upgrade your embedding model (re-embed everything). Routine re-embedding for unchanged content is unnecessary. Set up change detection in your ingestion pipeline so updates trigger targeted re-embedding automatically rather than requiring full-corpus rebuilds.

Is vector search replacing traditional keyword search?

No—it's complementing it. Keyword search remains superior for exact matches, named entities, and short precise queries. Vector search wins on semantic similarity, paraphrase matching, and intent-based retrieval. Hybrid search that combines both consistently outperforms either alone on enterprise datasets, which is why every serious retrieval system today uses some form of hybrid approach.

Key Takeaways

Embeddings are vectors that represent meaning; their geometry encodes semantic relationships, enabling similarity-based search.
Choose your distance metric (cosine, dot product, L2) to match how your embedding model was trained—mismatches silently degrade quality.
Approximate Nearest Neighbor algorithms (HNSW, IVF, PQ) make vector search practical at scale by trading small amounts of recall for large latency gains.
Vector databases add metadata filtering, persistence, CRUD operations, and hybrid search on top of raw ANN indexing—these operational features matter as much as raw search speed.
RAG quality is determined primarily by retrieval, not the LLM; chunk size, embedding model choice, and hybrid search are the highest-leverage variables.
Always evaluate embedding models on your own data; MTEB rankings are a starting point, not a decision.
Plan for re-embedding your entire corpus when upgrading embedding model versions—the cost is real and frequently underestimated.
Hybrid search (vector + BM25) outperforms pure vector search on most enterprise datasets; adopt it by default rather than as an afterthought.
Build a labeled retrieval evaluation set early; without it, you cannot diagnose regressions or validate improvements.

No prior expertise in linear algebra or machine learning is required. You do need to care about building things that work.

What Embeddings Actually Are

How Models Learn to Embed

Embedding models are trained with objectives that force semantically related inputs to produce similar vectors. The most common approaches:

Contrastive learning: pairs of similar items are trained to pull their vectors together; dissimilar pairs push them apart. OpenAI's text-embedding models and models like Sentence-BERT use variants of this.
Masked language modeling as a precursor: models like BERT produce contextual embeddings as a byproduct of learning to predict masked tokens. These are fine-tuned for retrieval tasks with additional training.
Bi-encoder architecture: two separate encoder passes—one for the query, one for the document—produce vectors that can be compared efficiently at inference time. This is distinct from cross-encoders, which process both inputs jointly and are more accurate but far slower.

Dimensionality and Trade-offs

The Distance Metrics That Define Similarity

Once you have vectors, "similarity" means measuring distance between them. Three metrics dominate:

Cosine similarity: measures the angle between two vectors, ignoring magnitude. This is the default for most text search because it captures directional meaning regardless of vector length. Values range from –1 to 1; in practice, most semantic similarity scores fall between 0.7 and 1.0 for genuinely similar items.
Dot product: similar to cosine but sensitive to magnitude. Useful when the model is specifically trained to use magnitude as a relevance signal. Many recent OpenAI embedding models are optimized for dot product comparisons.
Euclidean distance (L2): straight-line distance in the vector space. Appropriate for certain image embedding tasks but less common for text.

The choice of metric should match how the embedding model was trained. Using the wrong metric degrades search quality and produces confusing, hard-to-diagnose failures.

How Vector Search Works at Scale

The Main ANN Approaches

ScaNN (Google's algorithm): uses anisotropic quantization to prioritize accuracy for high-similarity vectors. Excellent benchmark performance; available in the scann Python library.

Recall vs. Latency: The Fundamental Trade-off

Vector Databases: What They Add Beyond the Index

A vector database is not just an ANN index. It adds the operational infrastructure you need to actually run a production system:

Metadata filtering: filter by structured attributes (date range, category, user ID) before or after vector search. Without this, you'd return semantically similar results that are irrelevant for business reasons.
Persistence and durability: unlike in-memory libraries like Faiss, vector databases handle restarts gracefully.
CRUD operations: real datasets change. You need to add, update, and delete vectors without rebuilding the entire index.
Hybrid search: combining vector similarity with keyword (BM25) search. In practice, hybrid search outperforms pure vector search for most enterprise retrieval tasks, especially on named entities and short queries.

Major Options and Their Positioning

Retrieval-Augmented Generation: Why Embeddings Matter for LLMs

The quality of your RAG system is largely determined by retrieval quality, not the LLM. A better model cannot compensate for retrieving the wrong chunks. The failure modes are predictable:

Chunking is too coarse: chunks contain mixed topics; the retrieved text is noisy.
Chunking is too fine: individual chunks lack enough context to be useful in isolation.
Embedding model mismatch: using a general-purpose model for a specialized domain (medical, legal, code) produces poor semantic alignment.
Missing metadata filters: retrieving semantically similar but wrong-date or wrong-user content.

Choosing and Evaluating Embedding Models

Key evaluation considerations:

Domain alignment: a general model trained on web text may underperform on legal contracts or medical notes. Fine-tuning an embedding model on domain-specific data typically produces 10–25% recall improvements in specialized settings.
Language: multilingual models (like multilingual-e5-large) trade some English performance for cross-language capability. If your data is multilingual, this trade is usually worth it.
Context window: most embedding models have a 512-token limit. Some now support 8K tokens. Longer context isn't always better—it can dilute signal—but it matters for long-document tasks.
Cost: embedding 1 million tokens with OpenAI's text-embedding-3-small costs roughly $0.02 at current pricing. At scale, model choice and batching strategy have real financial implications.

Always evaluate on your own data. Leaderboard ranking is a starting point, not a decision.

Beyond Text: Multimodal and Specialized Embeddings

Production Pitfalls and How to Avoid Them

Several failure patterns appear repeatedly in teams shipping their first vector search system:

Frequently Asked Questions

What's the difference between an embedding and a vector?

Do I need a vector database, or can I use Postgres with pgvector?

How do I know if my retrieval quality is good enough?

Can I use the same embedding model for both documents and queries?

How often do I need to re-embed my data?

Is vector search replacing traditional keyword search?

Key Takeaways

Embeddings are vectors that represent meaning; their geometry encodes semantic relationships, enabling similarity-based search.
Choose your distance metric (cosine, dot product, L2) to match how your embedding model was trained—mismatches silently degrade quality.
Approximate Nearest Neighbor algorithms (HNSW, IVF, PQ) make vector search practical at scale by trading small amounts of recall for large latency gains.
Vector databases add metadata filtering, persistence, CRUD operations, and hybrid search on top of raw ANN indexing—these operational features matter as much as raw search speed.
RAG quality is determined primarily by retrieval, not the LLM; chunk size, embedding model choice, and hybrid search are the highest-leverage variables.
Always evaluate embedding models on your own data; MTEB rankings are a starting point, not a decision.
Plan for re-embedding your entire corpus when upgrading embedding model versions—the cost is real and frequently underestimated.
Hybrid search (vector + BM25) outperforms pure vector search on most enterprise datasets; adopt it by default rather than as an afterthought.
Build a labeled retrieval evaluation set early; without it, you cannot diagnose regressions or validate improvements.

Stop Hand-Waving at Embeddings and Pay Less for It

What Embeddings Actually Are

How Models Learn to Embed

Dimensionality and Trade-offs

The Distance Metrics That Define Similarity

How Vector Search Works at Scale

The Main ANN Approaches

Recall vs. Latency: The Fundamental Trade-off

Vector Databases: What They Add Beyond the Index

Major Options and Their Positioning

Retrieval-Augmented Generation: Why Embeddings Matter for LLMs

Choosing and Evaluating Embedding Models

Beyond Text: Multimodal and Specialized Embeddings

Production Pitfalls and How to Avoid Them

Frequently Asked Questions

What's the difference between an embedding and a vector?

Do I need a vector database, or can I use Postgres with pgvector?

How do I know if my retrieval quality is good enough?

Can I use the same embedding model for both documents and queries?

How often do I need to re-embed my data?

Is vector search replacing traditional keyword search?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Stop Hand-Waving at Embeddings and Pay Less for It

What Embeddings Actually Are

How Models Learn to Embed

Dimensionality and Trade-offs

The Distance Metrics That Define Similarity

How Vector Search Works at Scale

The Main ANN Approaches

Recall vs. Latency: The Fundamental Trade-off

Vector Databases: What They Add Beyond the Index

Major Options and Their Positioning

Retrieval-Augmented Generation: Why Embeddings Matter for LLMs

Choosing and Evaluating Embedding Models

Beyond Text: Multimodal and Specialized Embeddings

Production Pitfalls and How to Avoid Them

Frequently Asked Questions

What's the difference between an embedding and a vector?

Do I need a vector database, or can I use Postgres with pgvector?

How do I know if my retrieval quality is good enough?

Can I use the same embedding model for both documents and queries?

How often do I need to re-embed my data?

Is vector search replacing traditional keyword search?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?