If you've ever wondered how a search engine finds articles "about the same topic" even when they share no keywords in common, or how a chatbot retrieves the right context before answering your question, the answer almost always involves embeddings and vector search. These two technologies are the invisible plumbing behind a surprising share of modern AI applications — and once you understand them, a whole class of AI capabilities suddenly makes sense.
This guide starts from zero. No linear algebra required, no machine learning background assumed. By the end, you'll be able to explain what embeddings are, why they're useful, how vector search works in practice, and how these ideas connect to the larger AI systems you're already using or building. If you've read The Complete Guide to How Generative AI Works, think of this as the deep dive on one of its most important mechanical components.
What an Embedding Actually Is
An embedding is a list of numbers that represents the meaning of something — a word, a sentence, an image, a product, a customer profile. That list is called a vector.
Here's the key insight: similar things get similar vectors. The words "dog" and "puppy" end up with lists of numbers that are close to each other. The words "dog" and "invoice" end up with lists that are far apart. The model that creates these vectors was trained on enormous amounts of text (or images, or audio) and learned, through exposure, which concepts tend to appear in similar contexts. Proximity in the list of numbers reflects proximity in meaning.
A typical embedding vector has 768 to 3,072 numbers, depending on the model. You never look at those numbers directly. What matters is the spatial relationship between them.
The Library Analogy
Imagine a vast library where books aren't organized by title or author but by conceptual proximity. Books about tax law cluster in one corner. Books about heartbreak cluster in another. A book about the tax implications of a divorce settlement sits somewhere between both clusters. That three-dimensional arrangement is exactly what an embedding space does — except it operates in hundreds or thousands of dimensions, which is why it can capture nuances that no filing system could.
What Can Be Embedded
Almost anything that can be represented digitally:
- Text: words, sentences, paragraphs, documents, code
- Images: photos, diagrams, screenshots
- Audio: speech, music clips
- Structured data: products, user profiles, transaction records (with the right model)
The type of embedding model you use determines what kind of input it accepts and what kind of semantic understanding it captures.
From Words to Numbers: How Embedding Models Work
You don't need to train an embedding model yourself. You call one. OpenAI's text-embedding-3-small, Cohere's embed-english-v3.0, and Google's text-embedding-004 are all available via API. You send text in; you get a vector back.
Under the hood, these models are transformers — the same architecture that powers large language models — but fine-tuned specifically to produce vectors where semantic similarity corresponds to geometric closeness. If you want the full picture of how transformers learn from text, How Generative AI Works: A Beginner's Guide covers the foundations clearly.
One Model, One Space
Every embedding model creates its own coordinate system. An embedding from OpenAI and an embedding from Cohere are not comparable — they live in different spaces with different rules. If you embed your knowledge base with one model, you must embed your queries with the same model. Mixing models breaks the geometry.
What Vector Search Is
Once you have a collection of embeddings — say, all the articles in your help center — you need a fast way to answer the question: which stored vector is most similar to this new query vector?
That's vector search, also called approximate nearest neighbor (ANN) search or semantic search.
The process has three steps:
- Embed the query. A user types "how do I cancel my subscription?" You run that text through your embedding model and get a query vector.
- Search the index. You compare that query vector against every stored vector (or an intelligent approximation of all of them) and find the closest ones by distance.
- Return results. You retrieve the top-k most similar items — typically 3 to 20 — ranked by closeness.
Measuring Distance
"Closeness" is measured mathematically. The two dominant methods are:
- Cosine similarity: Measures the angle between two vectors. Two vectors pointing in the same direction have a cosine similarity of 1.0 (identical). This works well for text.
- Euclidean (L2) distance: Measures the straight-line distance between two points. More intuitive, but cosine similarity generally performs better for semantic text search.
Most practitioners use cosine similarity for text embeddings and don't need to think about it further. Your vector database handles the math.
Vector Databases: Where Embeddings Live
A regular SQL database can technically store a list of numbers, but it has no efficient way to search across millions of them for geometric proximity. A vector database is purpose-built for exactly that.
The major options as of now:
| Database | Hosting | Best for | | ------------ | ------------------------ | -------------------------------- | | Pinecone | Fully managed cloud | Fast setup, production workloads | | Weaviate | Self-hosted or cloud | Hybrid search, flexibility | | Qdrant | Self-hosted or cloud | Speed, open-source control | | Chroma | Local / embedded | Prototyping, local development | | pgvector | Extension for PostgreSQL | Teams already running Postgres |
For most agency or professional use cases starting out, Pinecone or pgvector covers the ground well. Pinecone minimizes operational overhead. pgvector lets you stay inside infrastructure you already manage.
How Indexes Work
ANN indexes use algorithms like HNSW (Hierarchical Navigable Small World graphs) to avoid comparing your query vector against every single stored vector — a process that would be prohibitively slow at millions of records. Instead, the index builds a graph structure during ingestion that allows the search to jump directly to likely neighborhoods in the vector space. The trade-off is a small, configurable accuracy loss for a large speed gain. In practice, well-tuned ANN retrieval reaches 95–99% recall compared to exact brute-force search, which is more than sufficient for production applications.
The Retrieval-Augmented Generation Connection
If you've encountered the term RAG — Retrieval-Augmented Generation — embeddings and vector search are its core machinery.
The pattern works like this:
- You ingest a corpus (documents, policies, product specs) and embed each chunk.
- You store those vectors in a vector database.
- When a user asks a question, you embed the query, run vector search against your corpus, and retrieve the most relevant chunks.
- You pass those chunks to an LLM as context, and the LLM synthesizes a grounded answer.
This is why RAG systems can answer questions about proprietary company information without fine-tuning an LLM — the knowledge lives in the vector store, not in the model's weights. If you're building repeatable AI workflows, Building a Repeatable Workflow for Large Language Models covers how to structure the retrieval and generation steps in a production-ready pipeline.
Why Chunking Matters
You rarely embed whole documents. You split them into chunks — paragraphs, sections, 256–512 token blocks — before embedding. This is because a 10,000-word document, when embedded as a single vector, produces a highly averaged representation that dilutes specificity. Smaller chunks produce sharper, more retrievable vectors.
Chunking strategy is one of the most consequential, least glamorous decisions in RAG system design. Common problems:
- Too small: Chunks lack enough context to be meaningful.
- Too large: Retrieval pulls in irrelevant surrounding content.
- Split at wrong boundaries: Cutting in the middle of a logical idea destroys cohesion.
A starting point that works well in most cases: 300–400 tokens per chunk, with a 10–15% overlap between consecutive chunks to avoid losing context at the seams.
Common Failure Modes (and How to Avoid Them)
Stale Embeddings
If your source documents update, your embeddings don't update automatically. A policy document that changed six months ago might still be serving answers from its old content. Build a re-ingestion pipeline that detects changes and re-embeds updated chunks. Most teams revisit this too late.
Embedding Drift
Switching embedding models mid-project is a breaking change. If you upgrade from text-embedding-ada-002 to text-embedding-3-large, you must re-embed your entire corpus. Track your embedding model version as carefully as you track any other dependency.
Retrieval Without Evaluation
Teams often ship vector search pipelines without measuring retrieval quality. The LLM generates plausible-sounding answers, so problems in retrieval are invisible until something goes wrong. Run offline evaluations: take a sample of expected queries, check whether the right chunks actually come back in the top-k, and track recall@k as a metric.
Ignoring Hybrid Search
Pure semantic search misperforms on exact-match queries. If a user types a product SKU, a model name, or a proper noun, vector search may return conceptually related items rather than the exact right one. Hybrid search combines vector search with keyword (BM25) search and fuses the results. Most production systems benefit from this. Weaviate and Qdrant support it natively; Pinecone introduced it in recent releases.
Practical Starting Points
You don't need to build a custom vector database on day one. A minimal first experiment:
- Choose a small corpus: 50–200 documents or FAQ entries you know well.
- Pick an embedding model:
text-embedding-3-smallfrom OpenAI is inexpensive and high quality. - Use Chroma locally: Free, runs in-process, no account needed, ideal for learning.
- Write 10 test queries: Ask questions the corpus should answer, and 5 it shouldn't.
- Evaluate manually: Look at what comes back in the top 3 results. Where does it fail?
This loop — embed, retrieve, evaluate, adjust chunking — is the core skill. The technology in The Large Language Models Playbook describes how to layer generation on top once retrieval is working cleanly.
The larger picture of where these systems are headed — multi-modal embeddings, real-time indexing at scale, tighter integration between retrieval and reasoning — is covered in The Future of Large Language Models.
Frequently Asked Questions
What's the difference between embeddings and tokens?
Tokens are the units a language model uses to process text — roughly word fragments, typically 3–4 characters each. Embeddings are numerical representations of meaning. Tokens are the input mechanism; embeddings are the output representation. A language model can produce embeddings from the same text it processes with tokens.
Do I need a GPU to generate embeddings?
Not if you're using an API. When you call OpenAI's or Cohere's embedding endpoint, the computation happens on their infrastructure. If you run open-source embedding models locally (such as sentence-transformers on Hugging Face), a GPU significantly speeds up batch processing, but many models run acceptably on CPU for modest workloads.
How many vectors can a vector database handle?
Modern vector databases scale comfortably to tens of millions of vectors on managed infrastructure. At hundreds of millions or billions of vectors, architecture choices (sharding, quantization, approximate indexing settings) become load-bearing. For most professional and agency applications, scale is not a limiting factor.
How is vector search different from traditional keyword search?
Keyword search matches on exact or stemmed terms — if the query word isn't in the document, it doesn't match. Vector search matches on meaning — a query about "canceling a subscription" retrieves documents about "ending a membership" even without shared vocabulary. The trade-off is that keyword search is more precise for exact lookups; vector search is more robust for natural language and paraphrased queries.
Can embeddings handle languages other than English?
Yes. Models like OpenAI's text-embedding-3 and Cohere's multilingual embed models are trained on dozens of languages and can retrieve cross-lingually — a query in Spanish can retrieve a relevant document in English if both are in the same embedding space. Performance varies by language depending on training data coverage.
Key Takeaways
- An embedding is a vector — a list of numbers — that encodes the meaning of content so that similar things have similar vectors.
- Embedding models (OpenAI, Cohere, Google, open-source) do the conversion; you call them via API.
- Vector search finds the closest stored vectors to a query vector, enabling semantic retrieval without keyword matching.
- Vector databases (Pinecone, Qdrant, Weaviate, pgvector, Chroma) store and index vectors for fast retrieval.
- RAG systems use embeddings and vector search to give LLMs access to external knowledge at query time.
- Chunking strategy, model consistency, and retrieval evaluation are the three highest-leverage variables in system quality.
- Hybrid search — combining vector and keyword methods — handles the failure modes that pure semantic search misses.
- The best way to learn is to build a small, evaluatable experiment before scaling anything.