Semantic search used to require custom ML teams, months of infrastructure work, and a tolerance for expensive failure. That's no longer true. The ecosystem of embeddings and vector search tools has matured fast enough that a small agency can now stand up production-grade semantic retrieval in a week, not a quarter. The hard part has shifted from "can we build this?" to "which tools should we actually use?"
That shift creates its own confusion. Dozens of databases, embedding APIs, and hybrid search engines now compete for the same budget line. They overlap in capability, differ wildly in operational complexity, and rarely fail in ways their marketing pages mention. Choosing wrong costs you either in migration pain or in performance ceilings you'll hit six months after launch.
This article surveys the current tooling landscape for embeddings and vector search, lays out the criteria that actually determine fit, and gives you the trade-offs you need to make a defensible choice. Whether you're building a document retrieval system, a recommendation engine, or a retrieval-augmented generation (RAG) pipeline, the decision framework here applies. For grounding on why retrieval matters to generative AI at all, How Generative AI Works: Real-World Examples and Use Cases is worth reading first.
What Embeddings and Vector Search Actually Do
An embedding is a numerical representation of meaning. Feed a sentence, image, or product description into an embedding model, and you get back a vector — a list of floating-point numbers, typically 384 to 3,072 dimensions depending on the model — that encodes semantic content in a form math can operate on.
Vector search exploits the geometry of that space. Vectors that are close together represent things that mean similar things. "Quarterly revenue report" and "Q3 financial summary" land near each other even though they share almost no keywords. That's the capability keyword search can't replicate.
The Two-Layer Architecture
Every serious implementation has two layers:
- Embedding layer: The model that converts raw content to vectors. This is where semantic quality is determined.
- Search layer: The database or index that stores vectors and retrieves the nearest neighbors at query time. This is where speed, scale, and cost are determined.
These layers are independently swappable, which is both a flexibility and a complexity. Optimizing one without understanding the other produces a system that's either slow to query, expensive to run, or semantically shallow.
Embedding Models: Where Semantic Quality Lives
The embedding model you choose sets the ceiling on retrieval quality. No index can recover meaning a weak embedding model failed to encode.
Hosted API Models
OpenAI `text-embedding-3-large` (3,072 dimensions) and text-embedding-3-small (1,536 dimensions) are the de facto defaults for teams already in the OpenAI ecosystem. Pricing runs in the range of $0.00002–$0.00013 per 1,000 tokens depending on model tier. Quality is strong across English; multilingual performance is decent but not best-in-class.
Cohere Embed v3 is meaningfully better at multilingual retrieval and offers a "compressed" embedding option that reduces storage cost by roughly 90% with modest quality loss — a compelling trade-off for large corpora. Cohere also exposes input type parameters (search document vs. search query vs. classification) that improve precision when used correctly.
Google's `text-embedding-004` via Vertex AI integrates cleanly into GCP-native stacks and performs well on BEIR benchmarks, though developer ergonomics outside GCP are rougher than OpenAI's.
Open-Source and Self-Hosted Models
`bge-large-en-v1.5` and the newer `bge-m3` from BAAI consistently outperform much larger commercial models on retrieval benchmarks while running on a single A100 or even a beefy CPU for smaller datasets. Self-hosting eliminates per-token costs but adds infrastructure overhead and forces you to own model versioning.
`nomic-embed-text-v1.5` is notable for being fully open-source (Apache 2.0), performing at or near OpenAI small-model quality, and supporting 8,192-token context windows — crucial for long-document retrieval.
The model selection decision usually comes down to: API convenience and reliability vs. cost at scale vs. data privacy requirements. If your corpus stays under ~50 million tokens per month, hosted APIs are almost always cheaper than the engineering overhead of self-hosting.
Vector Databases: The Core of the Search Layer
Pinecone
Pinecone is the managed-only option that removes operational complexity entirely. You get a serverless index, a clean SDK, and SLAs without running a single container. Query latency is typically 20–80ms at the p95 level under moderate load.
Trade-offs: no self-hosting option, vendor lock-in is real (export is possible but awkward), and costs escalate faster than open-source alternatives as dataset size grows past a few million vectors. Best fit for teams that want to ship fast and aren't at a scale where infrastructure cost is the dominant concern.
Weaviate
Weaviate is a full vector database with hybrid search (BM25 + vector) built in, a GraphQL and REST API, and a modular architecture that lets you attach embedding models directly to the database rather than pre-computing vectors externally. It runs on-premise, on managed cloud, or via Weaviate Cloud Services.
The hybrid search capability is underappreciated. Pure vector search misses exact matches and rare proper nouns. Combining dense vector similarity with sparse BM25 retrieval — what Weaviate calls "hybrid search" — typically improves recall by 10–25% on real-world enterprise document sets compared to vector-only retrieval.
Qdrant
Qdrant is written in Rust, which means it's memory-efficient and fast. It supports named vectors (storing multiple embedding spaces per object), payload filtering at index time, and quantization to reduce memory footprint by 4–16× with manageable quality loss. The open-source version is genuinely production-grade, and Qdrant Cloud offers managed hosting with a generous free tier.
If you're building a system where filtering (e.g., "only search documents from this client, with this status, after this date") is central to the retrieval logic, Qdrant's filtered vector search architecture handles this more efficiently than most competitors.
pgvector
pgvector adds vector similarity search to PostgreSQL via an extension. If your application data already lives in Postgres, this is often the right answer — not because it's the fastest vector search engine, but because keeping your data in one system eliminates synchronization complexity that will otherwise haunt you.
pgvector supports both exact and approximate nearest neighbor search (via HNSW and IVFFlat indexes). At datasets below roughly 1–2 million vectors, performance is competitive. Above that, you'll start feeling the limitations compared to purpose-built databases. Supabase, Neon, and several managed Postgres providers support pgvector without self-hosting the extension.
Chroma
Chroma is the lightweight option. It's designed for RAG prototyping and runs embedded in your application process with zero infrastructure setup. This makes it the right tool for local development, demos, and early-stage products. It's the wrong tool for anything that needs to survive a traffic spike or store more than a few hundred thousand vectors reliably.
Elasticsearch with kNN Search
Elasticsearch added approximate kNN vector search and has been expanding that capability through recent releases. If your organization already runs Elasticsearch for full-text search, the vector capability is worth evaluating because it offers true hybrid search with mature BM25 underneath — and your ops team already knows how to run the infrastructure. The trade-off is that Elasticsearch is operationally heavy for teams who'd otherwise run nothing.
Hybrid Search and Reranking: What Most Tutorials Skip
Vector search alone almost never outperforms a well-tuned hybrid system. The reason: embedding models struggle with rare terms, product codes, names, and domain-specific jargon that wasn't well-represented in training data.
The Hybrid Stack
A production retrieval pipeline typically looks like this:
- Retrieve a candidate set of 50–200 documents using hybrid search (dense vector + sparse BM25)
- Rerank the candidates using a cross-encoder model that scores each (query, document) pair jointly
- Return the top 5–20 results to the LLM or end user
Cohere Rerank, the open-source cross-encoder/ms-marco-MiniLM-L-6-v2, and newer options like bge-reranker-large all serve the reranking step. Reranking adds 50–200ms of latency but reliably improves precision at the top of the result set, which is the only part the downstream LLM actually uses.
This two-stage architecture is central to building reliable RAG pipelines. For a structured look at how retrieval fits into the broader generative AI workflow, see A Framework for How Generative AI Works.
Selection Criteria That Actually Matter
Ignore vague criteria like "scalability" and "ease of use." Here's what to actually evaluate:
- Dataset size now and in 12 months: Under 1M vectors, almost any tool works. Over 10M, you need to understand index build time, memory requirements, and cost.
- Filtering requirements: If you need to filter by metadata before or during vector search, not all tools handle this with equal efficiency. Qdrant and Weaviate handle this best; pgvector struggles at scale.
- Hybrid search need: If your corpus has proper nouns, codes, or technical terms, pure vector search will disappoint. Pick a tool with native hybrid support or plan to build it yourself.
- Operational ownership: Managed services cost 3–5× more per vector-hour than self-hosted equivalents at scale, but they eliminate an entire category of operational risk for smaller teams.
- Data residency and privacy: If your content is sensitive, self-hosted open-source (Qdrant, Weaviate local, pgvector) keeps data off third-party servers. This matters for legal, healthcare, and financial clients.
- Ecosystem and SDK quality: Bad SDKs create hidden tax. Qdrant and Pinecone have the best Python SDKs. Weaviate's v4 client improved dramatically over v3.
Common Failure Modes
Understanding where teams go wrong is as useful as knowing what to pick.
- Chunking strategy mismatch: Embedding 10-page documents as single vectors means the model averages across too much content. Chunks of 200–500 tokens with meaningful overlap (50–100 tokens) work better for most retrieval tasks.
- Ignoring embedding model updates: Embedding models improve. When you upgrade the model, you must re-embed your entire corpus — with the new model and the old model active simultaneously during migration. Teams that don't plan for this end up with a partially-migrated index that returns inconsistent results.
- Over-relying on cosine similarity alone: Similarity scores are not calibrated probabilities. A score of 0.87 doesn't mean "87% relevant." Set similarity thresholds carefully and test them against real queries with real users.
- Skipping evaluation: Retrieval quality should be measured with a test set of queries and labeled relevant documents before you ship. Tools like RAGAS and TruLens make this tractable. Shipping without evaluation means you won't know you have a recall problem until a client complains.
As the tooling for generative AI matures, retrieval quality is increasingly the differentiator between AI products that feel reliable and ones that hallucinate. Case Study: How Generative AI Works in Practice shows how retrieval fits into end-to-end implementations in real organizations.
Frequently Asked Questions
What's the difference between a vector database and adding vector search to an existing database?
A purpose-built vector database (Pinecone, Qdrant, Weaviate) is architected around approximate nearest neighbor search as the primary access pattern — its index structures, memory management, and query planner are all optimized for this. Extensions like pgvector add vector search to a general-purpose database that wasn't designed for it, which works well at moderate scale but hits performance ceilings earlier and with fewer tuning options.
How many dimensions should my embeddings be?
More dimensions don't always mean better retrieval. Models like text-embedding-3-small at 1,536 dimensions often match or outperform older 768-dimension models, but the relationship isn't linear. Higher dimensions increase storage and query cost. Many practitioners find that 768–1,536 dimensions is the practical sweet spot for English-language retrieval tasks, with larger dimensions reserved for multilingual or cross-modal applications.
Can I switch embedding models after building my index?
Yes, but it requires re-embedding your entire corpus from scratch with the new model. Because different models produce vectors in different geometric spaces, you cannot mix vectors from different models in the same index. Plan embedding model selection carefully, and treat any model upgrade as a migration event requiring a re-index.
Is vector search replacing keyword search?
Not replacing — complementing. Pure vector search struggles with exact matches, rare terms, and highly specific queries. Pure keyword search misses paraphrase and semantic variation. Production systems that serve real users almost always benefit from hybrid retrieval that combines both, with a reranker to reconcile the two result sets.
What's the minimum viable setup for a RAG prototype?
For a local prototype: Chroma as the vector store, OpenAI text-embedding-3-small for embeddings, 300–400 token chunks with 50-token overlap, and a straightforward similarity search call before passing results to an LLM. This gets you a working demo in an afternoon. For a production system, replace Chroma with Qdrant or pgvector, add hybrid search, and implement reranking before launch.
How do I evaluate retrieval quality objectively?
Build a test set of 50–200 representative queries with labeled relevant documents from your corpus. Measure recall@k (what fraction of relevant documents appear in the top-k results) and MRR (mean reciprocal rank). Tools like RAGAS automate much of this for RAG-specific evaluation. Without a test set, you're flying blind — impressive demo queries often don't predict performance on the long tail of real user queries.
Key Takeaways
- Embeddings and vector search are two distinct layers with different quality levers; optimize each separately.
- For most teams under 1M vectors, pgvector or Chroma is sufficient for prototyping; Qdrant or Weaviate for production with filtering and hybrid search needs.
- Pinecone wins on operational simplicity; open-source alternatives win on cost at scale and data privacy.
- Hybrid search (dense + sparse + reranker) consistently outperforms pure vector search on real-world enterprise document sets.
- Chunking strategy and embedding model selection have more impact on retrieval quality than database choice.
- Plan for embedding model migration before you need it — re-indexing is non-trivial and requires careful cutover.
- Measure retrieval quality with a labeled test set before shipping; similarity scores alone are not a quality signal.