Choosing the wrong embedding model or vector database architecture is one of the most expensive early mistakes an AI team can make. The fix rarely costs just an afternoon — it usually means re-embedding your entire corpus, migrating data, and retuning retrieval pipelines at exactly the moment your team has momentum to lose. Getting the decision right upfront requires understanding what you're actually choosing between, not just which product has the best landing page.
This article is built for teams standing at that decision point: you understand roughly what embeddings are (dense numerical representations of meaning), you've started building or scoping a retrieval-augmented generation (RAG) system or semantic search feature, and you need a principled way to evaluate your options. We'll walk through the real axes of trade-off — not the marketing ones — and give you a concrete decision rule you can apply to your actual situation.
The payoff is avoiding the most common failure modes: paying for retrieval quality you don't need, scaling an architecture that can't handle your data volume, or locking into an embedding model whose behavior changes under your feet.
What You're Actually Choosing Between
Embeddings and vector search involve two separate decision trees that get collapsed into one in most guides. Keep them separate.
Embedding model decisions: Which model converts your content into vectors? Options range from general-purpose API models (OpenAI's text-embedding-3-small and text-embedding-3-large, Cohere's Embed v3, Google's text-embedding-004) to open-source models you host yourself (the bge family from BAAI, e5-mistral, nomic-embed-text). Dimension counts, max token windows, cross-lingual support, and fine-tunability vary enormously.
Vector store decisions: Where do those vectors live and how does similarity search happen? Options include dedicated vector databases (Pinecone, Weaviate, Qdrant, Milvus), vector extensions on relational databases (pgvector on Postgres), integrated solutions (Supabase, Neon with pgvector), and approximate nearest neighbor (ANN) libraries you wire up yourself (FAISS, HNSWlib).
These two choices interact — a high-dimension embedding model (3072 dimensions for OpenAI's large model) changes storage costs and query latency in your vector store — but they also have independent failure modes. Treat them that way.
The Five Axes That Actually Matter
Before comparing specific products, get clear on which axes drive your situation. Most teams optimize on one or two while ignoring the others, then discover the others matter most in production.
1. Retrieval Quality vs. Speed
Higher-quality embedding models tend to produce denser, more semantically nuanced representations. But they also tend to have larger dimension counts (1536–3072 vs. 384–768 for lighter models), which increases both index size and query latency. For a customer-facing search feature with sub-200ms requirements, a 384-dimension open-source model that runs locally often beats a cloud API model that adds 80–150ms of network round-trip on every query.
2. Corpus Size and Update Frequency
A static corpus of 10,000 documents behaves very differently from a live corpus of 10 million documents that changes hourly. At smaller scales, almost every solution works. At larger scales, you need to care about:
- Index type: Flat (exact) vs. HNSW vs. IVF-PQ (each trades recall for speed differently)
- Update patterns: HNSW is notoriously slow at deletes and updates compared to IVF indexes
- Sharding: Pinecone and Qdrant handle this for you; self-hosted FAISS does not
3. Operational Burden
Managed cloud services (Pinecone, Weaviate Cloud) abstract away infrastructure but add cost and a dependency. Self-hosted options (Qdrant on a VM, Milvus on Kubernetes, pgvector on your existing Postgres) require operational ownership but give you more control and often lower unit costs at scale.
4. Embedding Stability
This one gets overlooked until it causes pain. If your embedding model updates — whether you pull a new checkpoint of an open-source model or a vendor silently updates their API — your existing vectors become misaligned with new vectors. Queries against a mixed-version index degrade in ways that are hard to debug. Always version-pin your embedding models and plan for periodic re-embedding as a maintenance task.
5. Domain Specificity
General-purpose embedding models trained on web-scale text perform well on general queries. They underperform on highly specialized corpora: clinical notes, legal contracts, proprietary technical documentation, code. Fine-tuning a smaller open-source model on domain-specific pairs often outperforms a larger general-purpose API model in these cases — sometimes significantly — at a fraction of the cost per query.
Embedding Model Options and When to Choose Each
General-Purpose API Models
OpenAI text-embedding-3-small / 3-large: The safe default for most teams starting out. Strong benchmark performance across English and multilingual tasks. The small model (1536 dimensions, reducible to 256 via Matryoshka) is cost-effective for most use cases. The large model (3072 dimensions) earns its price only if you're seeing measurable retrieval quality gaps in evaluation.
Cohere Embed v3: Differentiated primarily by its input_type parameter — you specify whether text is a search query or a document at embedding time, which meaningfully improves retrieval quality. Worth evaluating if you're building query-document retrieval systems.
Google text-embedding-004: Strong multilingual performance. Reasonable choice if you're already operating in the Google Cloud ecosystem.
Open-Source / Self-Hosted Models
BGE (BAAI General Embedding) family: Strong performers on the MTEB leaderboard across multiple tasks. bge-m3 is a standout for multilingual use cases. License is permissive.
Nomic Embed Text: 8192-token context window, which matters for long-document use cases where chunking is lossy. Open weights, commercially usable.
E5-mistral-7b: High retrieval quality but at significant inference cost. Only sensible if you're running large-scale pipelines where per-query API costs outweigh self-hosting at scale.
Vector Store Options and When to Choose Each
Managed Cloud
Pinecone: The easiest path to production. Strong performance, good developer experience, generous free tier. Cost scales with index size and query volume in ways that can surprise teams — run unit economics at your expected scale before committing.
Weaviate Cloud: Adds hybrid search (BM25 + vector) out of the box, which is often the right retrieval strategy anyway (see below). More opinionated than Pinecone; the trade-off is more built-in functionality vs. more complexity to understand.
Self-Hosted Dedicated
Qdrant: Strong performance benchmarks, actively maintained, good Rust-based efficiency. Reasonable to run as a Docker container for small-to-medium workloads. Supports payload filtering natively, which matters if you need metadata-gated retrieval.
Milvus: Enterprise-grade, designed for very large-scale deployments. Overkill for most teams below tens of millions of vectors; worth considering above that threshold.
Existing Infrastructure Extensions
pgvector on Postgres: The pragmatic choice if you already operate Postgres. Performance is adequate for corpora under roughly 1–5 million vectors with HNSW indexing. The massive advantage is zero new infrastructure: no additional service to operate, monitor, or pay for. A Framework for How Generative AI Works covers this integration pattern in context.
Exact Search vs. Approximate Nearest Neighbor
Exact (flat/brute-force) search guarantees recall but scales at O(n) — it examines every vector. Practical only up to a few hundred thousand vectors unless you have significant compute.
ANN algorithms (HNSW, IVF-PQ, ScaNN) trade a small, tunable amount of recall for orders-of-magnitude faster search. The two dominant approaches:
HNSW (Hierarchical Navigable Small World): Excellent query speed, high recall at reasonable ef parameters. Poor at updates — if you delete or update vectors frequently, graph integrity degrades. Good for read-heavy, relatively static indexes.
IVF-PQ (Inverted File with Product Quantization): Better at handling large, dynamic indexes. More parameters to tune. Lower memory footprint than HNSW at large scale due to quantization.
Most managed vector databases handle these choices internally; the reason to understand them is knowing when to push back on defaults and how to interpret performance benchmarks.
Hybrid Search: When Pure Vector Isn't Enough
A common failure mode: teams build pure vector search expecting it to outperform keyword search in all cases. It doesn't. Vector search excels at semantic similarity — finding conceptually related content even without lexical overlap. Keyword search (BM25) excels at exact-match queries, proper nouns, product codes, and rare terms that embedding models may have seen infrequently.
Hybrid search — combining BM25 and vector similarity scores (typically via Reciprocal Rank Fusion or a learned reranker) — consistently outperforms either method alone across a wide range of retrieval tasks. This is well-supported in Weaviate, Elasticsearch, and can be implemented in pgvector with some custom logic.
The Best Tools for How Generative AI Works catalogs several platforms that provide this hybrid retrieval layer natively.
If you're building retrieval for a RAG pipeline specifically, also evaluate adding a cross-encoder reranker (Cohere Rerank, bge-reranker, cross-encoder/ms-marco) as a second-pass over your top-k retrieved candidates. Rerankers are slower than embedding-based retrieval but significantly more accurate — used on a small candidate set (top 20–50), the latency cost is usually acceptable.
The Decision Rule
Reduce your situation to three questions:
1. What is your corpus size and update pattern?
- Under 1M vectors, static or slow-moving → pgvector with HNSW, or any managed service
- 1M–10M vectors, moderate updates → Qdrant or Pinecone
- 10M+ vectors or high-frequency updates → Milvus, Weaviate, or a purpose-designed pipeline
2. Do you need domain-specific performance?
- General content, English-primary → Start with
text-embedding-3-smallorbge-base-en-v1.5 - Specialized domain or multilingual → Evaluate fine-tuned open-source models; run MTEB-style evals on your own data
- Long documents (>512 tokens meaningfully) →
nomic-embed-textor chunking strategy review
3. What is your operational capacity?
- No dedicated ML/infra eng → Managed cloud (Pinecone, Weaviate Cloud) + API embedding model
- Existing Postgres, small corpus → pgvector; don't add complexity you don't need
- Engineering capacity, cost sensitivity at scale → Self-hosted Qdrant or Milvus + open-source embedding
Case Study: How Generative AI Works in Practice shows how this decision tree plays out in a real deployment, including the places teams typically have to revise their initial choices.
Frequently Asked Questions
How many vectors can pgvector handle before it struggles?
With HNSW indexing, pgvector performs adequately up to roughly 1–5 million vectors on modern hardware, depending on dimension count and query throughput requirements. Beyond that range — or if you need sub-10ms P95 latency at high concurrency — a dedicated vector database typically pulls ahead. The threshold isn't hard; benchmark against your actual workload before migrating.
Does it matter which embedding model I use as long as I'm consistent?
Consistency within an index is essential — all your vectors must come from the same model version. But the choice of model materially affects retrieval quality, especially on domain-specific content. Don't assume a larger or more expensive model automatically wins; run retrieval evals on a sample of your actual queries before committing.
What is Matryoshka embedding and should I use it?
Matryoshka Representation Learning trains embeddings so that the first N dimensions are themselves a meaningful lower-dimensional embedding. OpenAI's text-embedding-3 models support this, letting you reduce from 1536 to 256 dimensions with modest quality loss. Useful when storage or latency costs are constraining; run quality benchmarks at your target dimension before deploying at scale.
When should I add a reranker to my retrieval pipeline?
Add a reranker when retrieval quality matters more than latency and you can afford a second-pass computation over your top-k candidates. Typical triggers: your RAG pipeline produces good retrieved documents but the LLM still generates poor answers (often a relevance ordering problem), or evaluations show your top-1 precision is low. The How Generative AI Works Checklist for 2026 includes a retrieval evaluation framework that helps identify when reranking is the right fix.
Is it better to use one large chunk or many small chunks when embedding?
Neither extreme is optimal. Large chunks (>512 tokens) lose precision because the embedding averages over too much content. Small chunks (<50 tokens) lose context and produce noisy embeddings. A practical default: 256–512 token chunks with 10–20% overlap. Then test retrieval quality on your actual query distribution — chunk size is often the highest-leverage variable to tune.
What is the real cost difference between API embeddings and self-hosting?
At low volume (under a few million embeddings per month), API models are almost always cheaper when you factor in engineering and infrastructure time. At high volume or real-time embedding of user-generated content, self-hosting a model like bge-base on a GPU instance typically reaches break-even in the range of tens of millions of tokens per day. Run the math for your specific throughput; don't assume either direction without numbers.
Key Takeaways
- Embeddings and vector search are two separate decision trees with independent trade-offs; keep them distinct.
- The five axes that matter: retrieval quality vs. speed, corpus size and update frequency, operational burden, embedding stability, and domain specificity.
- pgvector is the right default for small corpora on existing Postgres infrastructure; dedicated vector databases earn their place above 1–5M vectors.
- Hybrid search (vector + BM25) consistently outperforms pure vector retrieval; default to it unless you have a specific reason not to.
- Version-pin your embedding models — silent updates cause index drift that degrades retrieval quality in hard-to-diagnose ways.
- Run evals on your actual query distribution, not synthetic benchmarks; the winning model and architecture depend on your data, not the leaderboard.
- Rerankers are the highest-leverage retrieval improvement for RAG pipelines when baseline retrieval is functional but ordering quality is poor.