Embeddings and vector search sit beneath almost every impressive AI feature you've encountered lately—semantic search, document Q&A, recommendation engines, duplicate detection. Yet most explanations either stay surface-level ("it turns words into numbers") or plunge straight into linear algebra. Neither serves you if you need to make real decisions about how to build or buy AI systems.
This article takes a different approach. It answers the specific questions that professionals and technical leads actually ask when they start working with these systems: what embeddings really are, why vector search beats keyword search for certain tasks, when it doesn't, what it costs, and where things break. If you've already read a primer on how generative AI works and want to go deeper on the memory and retrieval layer, this is the piece for that.
By the end, you'll have enough conceptual grounding to evaluate tools, ask sharper questions of your engineering team, and avoid the category errors that send projects sideways.
What Exactly Is an Embedding?
An embedding is a list of numbers—a vector—that represents the meaning of something. That something can be a word, a sentence, a paragraph, an image, an audio clip, or a user's browsing history. The numbers aren't arbitrary; they're produced by a model trained to place similar things close together in that numerical space and dissimilar things far apart.
Think of it as a coordinate system for meaning. "Dog" and "puppy" end up near each other. "Dog" and "mortgage rate" end up far apart. The model has learned these proximity relationships from enormous amounts of training data.
Why vectors, not just tags or categories?
Tags and categories require someone to define the taxonomy in advance. Embeddings learn the relationships implicitly. That means they generalize—they can represent shades of meaning, multilingual equivalents, and concepts that don't have clean labels. A sentence about "fixing a bug in production at 2 a.m." and one about "resolving a critical deployment issue overnight" will land near each other in embedding space even if they share no exact words.
What does a typical embedding look like?
A standard text embedding from a model like OpenAI's text-embedding-3-small or Cohere's embed-english-v3.0 produces a vector with somewhere between 768 and 3,072 dimensions, depending on the model and settings. Each dimension is a floating-point number. You never interpret individual dimensions—what matters is the geometric relationships between whole vectors.
How Does Vector Search Actually Work?
Vector search finds items whose embeddings are geometrically close to a query embedding. The process has three steps:
- Embed the corpus. Convert every piece of content (documents, product descriptions, past support tickets) into vectors and store them in a vector database.
- Embed the query. At search time, convert the user's query into a vector using the same model.
- Find nearest neighbors. Retrieve the stored vectors most similar to the query vector, ranked by a distance metric.
What distance metrics are used?
The two most common are cosine similarity (measures the angle between vectors, ignoring magnitude) and dot product (measures both angle and magnitude). Euclidean distance is used less often for text. Most hosted vector databases default to cosine similarity, which is generally appropriate for semantic search on text.
What is approximate nearest neighbor (ANN) search?
Exact nearest neighbor search—checking every stored vector against the query—becomes prohibitively slow at scale. Approximate nearest neighbor algorithms (HNSW and IVF are the dominant ones) trade a small amount of accuracy for massive speed gains. In practice, you might retrieve the true top-10 results 95–99% of the time while cutting query latency from seconds to milliseconds. Most production systems use ANN without meaningful quality loss.
When Should You Use Vector Search Instead of Keyword Search?
Vector search is better when the user's intent might not share words with the best result. Classic use cases:
- Document Q&A: "What's our refund policy for custom orders?" should match the clause that says "bespoke items are non-returnable" even though "refund" and "non-returnable" are antonyms, not synonyms.
- Cross-language retrieval: An English query finding relevant French documents.
- Semantic deduplication: Identifying that two support tickets describe the same underlying problem.
- Personalization: Finding content similar to what a user has previously engaged with.
When keyword search is still better
Keyword (lexical) search wins when precision on specific terms matters more than conceptual similarity:
- Legal or compliance search where exact phrasing is legally significant.
- Code search for a specific function name or error string.
- Catalog lookup by product ID or SKU.
The real-world best practice for most production systems is hybrid search: run both lexical (BM25) and vector retrieval, then merge the result sets using a reranker or reciprocal rank fusion. This outperforms either approach alone in most benchmarks. Tools like Elasticsearch, OpenSearch, Weaviate, and Pinecone all support hybrid search configurations.
What Are Vector Databases, and Do You Need One?
A vector database is optimized specifically for storing embeddings and running fast ANN queries at scale. Leading options include Pinecone, Weaviate, Qdrant, Milvus, and Chroma. PostgreSQL with the pgvector extension is a popular choice when you want to avoid adding infrastructure.
When do you need a dedicated vector database?
You probably don't need a dedicated vector database if you have fewer than a few hundred thousand vectors and your query load is modest. pgvector or even in-memory libraries like FAISS can handle that workload. You start needing a purpose-built solution when you're managing millions of vectors, need real-time updates, require filtered search at scale (e.g., "find similar documents but only within this client's workspace"), or need multi-tenancy with strict data isolation.
The hidden operational cost
Every time you upgrade your embedding model, you have to re-embed your entire corpus. If you have 10 million documents embedded with text-embedding-ada-002 and want to migrate to a newer model, that's 10 million API calls plus storage migration. Budget for this from the start.
How Do Embeddings Power RAG (Retrieval-Augmented Generation)?
Retrieval-Augmented Generation is the architecture behind most enterprise chatbots and document Q&A tools. The pattern:
- Embed your knowledge base and store vectors.
- When a user asks a question, embed it and retrieve the top-k most relevant chunks.
- Pass those chunks plus the original question to a language model as context.
- The LLM generates an answer grounded in retrieved content rather than just its training data.
This is why understanding embeddings and vector search is foundational to building reliable AI features, not just a theoretical nicety. If your retrieval step fails—wrong chunking strategy, mismatched embedding model, poor similarity threshold—the LLM has nothing useful to reason from, and the whole system produces confident-sounding nonsense.
For teams building these systems, the skills involved overlap significantly with what's covered in how generative AI works as a career skill—retrieval engineering is becoming a distinct and valuable specialization.
What is chunking, and why does it matter?
Before embedding documents, you split them into chunks—smaller pieces the model processes at once. Chunk too small, and individual chunks lack context. Chunk too large, and you lose retrieval precision (a 2,000-word chunk might be retrieved because one sentence is relevant, burying the signal). Common strategies: fixed-size chunks with overlap (e.g., 512 tokens, 50-token overlap), sentence-level chunking, or semantic chunking that splits at topic boundaries. There's no universal right answer; it depends on your document types.
What Are the Real Costs?
Embeddings are cheap per unit but add up at scale. As of mid-2024, embedding a million tokens with OpenAI's text-embedding-3-small costs roughly $0.02. A 10-million-document corpus with an average of 200 words per document (~150 tokens each) would cost around $30 to embed initially—affordable. Re-embedding that corpus with a new model every 12–18 months is the more significant budget line.
Vector database costs depend heavily on whether you self-host or use a managed service:
- Self-hosted (Qdrant, Milvus, Weaviate OSS): Infrastructure costs only, but engineering overhead is real.
- Managed services (Pinecone, Weaviate Cloud, Zilliz): Pricing typically scales with the number of vectors and query volume. Expect $70–$300/month for a small production deployment; enterprise workloads can run significantly higher.
Query latency for managed vector search is typically in the 20–100ms range for sub-million vector collections, fast enough for real-time search applications.
Where Do Embedding Systems Break?
Understanding failure modes is more useful than understanding the happy path. The hidden risks of generative AI systems apply here too, with some retrieval-specific additions:
- Domain mismatch: General-purpose embedding models are trained on broad web text. If your corpus is highly specialized—legal contracts, medical records, industrial maintenance logs—a general model may embed domain-specific terms poorly. Fine-tuned or domain-specific models often perform significantly better.
- Language mismatch: Many embedding models are English-dominant. Multilingual models (like Cohere's multilingual embeddings or OpenAI's multilingual models) exist but often underperform language-specific models.
- Semantic drift under paraphrase: Embeddings are sensitive to sentence structure in ways that aren't always intuitive. A negation ("we do not offer refunds") may be closer in embedding space to ("we offer refunds") than intended, because both sentences share heavy semantic content about refunds.
- Stale embeddings: If your underlying documents change but you don't re-embed them, you're searching a ghost corpus. Implement update pipelines from day one.
- Threshold sensitivity: Setting a similarity threshold to filter out low-confidence results requires calibration. Too strict and you return nothing; too loose and you return noise. Most teams underinvest in this tuning.
Frequently Asked Questions
Do I need to understand embeddings to use AI tools effectively?
Not to use off-the-shelf tools, but yes if you're building, evaluating, or procuring AI systems. Most AI features involving search, recommendations, or document Q&A rely on embeddings under the hood. Understanding how they work helps you diagnose failures, set realistic expectations, and ask better questions when something goes wrong.
Can I use the same embedding model for different types of content?
You can use one model for multiple content types, but performance degrades when the model wasn't trained on that content. Code, structured data, images, and long-form text all benefit from models purpose-built for them. For mixed-content systems, consider separate embedding pipelines per content type with a unified retrieval layer on top.
How many vectors can a typical vector database handle?
Modern vector databases can scale to hundreds of millions or billions of vectors with the right infrastructure. For most agency-scale or mid-market enterprise applications, you'll be working with tens of thousands to low millions of vectors—well within the comfort zone of any major managed service or a well-configured pgvector instance.
What's the difference between embeddings and fine-tuning?
Embeddings represent content as vectors for retrieval; fine-tuning adjusts a model's weights so it behaves differently on a task. They solve different problems. Fine-tuning changes how a model reasons or responds. Embeddings change what information is available to the model at inference time. RAG (which uses embeddings) is often preferable to fine-tuning for knowledge-intensive applications because the knowledge stays inspectable and updatable.
Is vector search the same as semantic search?
Mostly yes, in practice. Semantic search refers to search that understands meaning rather than just matching keywords; vector search is the dominant technical implementation of that. The terms are often used interchangeably, though technically you could implement semantic search using other methods (like learned sparse retrieval).
How do I evaluate whether my vector search is working?
Standard evaluation metrics include precision@k (what fraction of your top-k results are genuinely relevant), recall@k (what fraction of all relevant results appear in your top-k), and MRR (mean reciprocal rank, which rewards systems that put the best result earlier). Build a test set of 50–200 query/expected-result pairs from real user queries before you build, and run evaluations after every model or pipeline change.
Key Takeaways
- Embeddings are vectors that represent the meaning of content; similar meanings produce geometrically close vectors.
- Vector search finds semantically related content by measuring distance between vectors, not matching keywords.
- Hybrid search—combining lexical and vector retrieval—outperforms either approach alone for most production use cases.
- RAG systems depend on retrieval quality; poor chunking or a mismatched embedding model will undermine even the best LLM.
- Re-embedding costs and stale-embedding drift are the most underestimated operational risks.
- You need a dedicated vector database only at significant scale;
pgvectorhandles most early-stage production workloads fine. - Domain mismatch, language mismatch, and similarity threshold miscalibration are the three most common failure modes worth testing for before launch.