A mid-size professional services firm — 140 employees, a sprawling knowledge base of client deliverables, internal SOPs, and industry research — spent an average of 23 minutes per employee per day searching for information they already possessed. The search tool was keyword-based. If someone typed "change management framework," they missed the document filed as "organizational transition methodology." The words didn't match. The meaning did. That gap, multiplied across hundreds of employees and thousands of documents, was costing the firm roughly 900 person-hours a month in wasted retrieval time.
This is an embeddings and vector search case study. It walks through exactly what the firm did, what broke, what worked, and what any operator or professional team can lift directly from the experience. The underlying technology — representing meaning as numbers, then finding proximity between those numbers — is the same technology powering retrieval-augmented generation systems, semantic recommendation engines, and intelligent document Q&A tools across industries. Understanding it at the implementation level, not just the conceptual level, is what separates teams that get value from it from teams that run expensive pilots that go nowhere.
The narrative follows a deliberate arc: situation, decision, execution, measurable outcome, lessons learned. If you want the mechanical foundation first, A Step-by-Step Approach to How Generative AI Works covers the generative model layer that often sits on top of vector retrieval. This article focuses on the retrieval layer itself.
The Situation: When Keyword Search Fails Meaning
The firm's knowledge base held approximately 14,000 documents at the time of the project. Content ranged from two-page meeting summaries to 90-page client strategy reports. The search tool was a basic full-text index — fast, reliable, and fundamentally limited to surface-level string matching.
The Real Cost Was Invisible
Because the failure mode was friction rather than outright failure, nobody had quantified it. Employees worked around the problem: they bookmarked files they knew they'd need again, maintained personal folder systems, or simply asked colleagues. The workarounds obscured the underlying cost. A time-study exercise — employees logging search attempts and outcomes across two work weeks — surfaced the 23-minute average and revealed that roughly 35% of searches ended with no result, even when a relevant document existed.
Why This Problem Is Structural
Keyword search retrieves documents that contain the query terms. It cannot retrieve documents that address the query concept without sharing its vocabulary. Legal teams searching "force majeure applicability" miss documents discussing "act of God clauses" or "contract suspension due to external events." Strategy teams searching "market entry" miss documents tagged "geographic expansion" or "new territory assessment." This isn't a tuning problem. It's an architectural one. No amount of synonym mapping or stemming fully closes the gap, because language is fundamentally unbounded in how it expresses the same idea.
The Decision: Choosing Embeddings and Vector Search
The firm's technology lead evaluated three options: an upgraded full-text search engine with better synonym handling, a hybrid BM25 plus vector approach, and a pure vector search implementation. The evaluation criteria were implementation timeline, cost, accuracy on a test query set, and ability to integrate with an eventual Q&A layer.
What Embeddings Actually Do
An embedding model takes a piece of text and outputs a vector — an ordered list of numbers, typically 768 to 1,536 dimensions depending on the model — that encodes semantic meaning. Two texts with similar meaning produce vectors that are close together in that high-dimensional space, regardless of whether they share any words. The distance between vectors (measured by cosine similarity or dot product) becomes a proxy for conceptual relatedness.
This is the core mechanism. Understanding it at this level matters because it predicts where the system will work well (meaning-based retrieval) and where it will struggle (exact string matching, proper nouns, very short or very ambiguous queries).
The Model Selection Decision
The team evaluated three embedding models: OpenAI's text-embedding-ada-002 (at the time), Cohere's embed-english-v3.0, and a sentence-transformers model (all-mpnet-base-v2) they could host locally. They built a test set of 150 query-document pairs, manually labeled for relevance, and measured recall@10 — whether the correct document appeared in the top 10 results.
Results on their test set: the Cohere model performed best at 91% recall@10, OpenAI landed at 87%, and the local sentence-transformers model at 79%. The local model's lower performance was partly attributable to the domain-specific vocabulary in the firm's documents — models trained on broader corpora and fine-tuned more recently had an edge. They chose Cohere for accuracy, with OpenAI as a fallback given broader API ecosystem support.
Choosing a Vector Store
Vector search requires a database purpose-built (or adapted) for nearest-neighbor retrieval. The team evaluated Pinecone, Weaviate, and pgvector (a PostgreSQL extension). Given that the firm already ran PostgreSQL and the document volume was manageable — 14,000 documents chunked to roughly 60,000 vectors — pgvector was sufficient and avoided adding a new infrastructure dependency. For larger-scale deployments, dedicated vector databases offer better performance on tens of millions of vectors and more sophisticated filtering options.
Execution: The Implementation in Five Stages
Stage 1: Chunking Strategy
Before embedding anything, the team had to decide how to split documents. Embedding an entire 90-page report as a single vector produces a vector that averages too many topics; it loses specificity. Embedding individual sentences produces vectors that lack context.
The team settled on a sliding window chunking approach: 400-token chunks with 50-token overlaps between adjacent chunks. The overlap prevents a relevant passage from being split across chunk boundaries in a way that degrades retrieval. Every chunk retained metadata: document title, creation date, author, and document type. Metadata filtering later proved essential — users often wanted to restrict results to documents from a specific time period or practice area.
Stage 2: Embedding and Indexing
The 14,000 documents chunked down to 58,000 vectors. Embedding them via the Cohere API at batch sizes of 96 took approximately 4 hours and cost roughly $40 at prevailing API rates. The vectors were stored in pgvector with an HNSW (Hierarchical Navigable Small World) index, which trades a small amount of recall for dramatically faster approximate nearest-neighbor search at query time.
A critical operational detail: the team built an incremental embedding pipeline from the start. New documents added to the knowledge base triggered automatic chunking and embedding within a nightly batch job. Without this, the index would have drifted from the actual document library within weeks.
Stage 3: Query Handling
At query time, the user's search string is embedded using the same model — this is non-negotiable; embedding model consistency between indexing and querying is a hard requirement. The resulting query vector is compared against all stored vectors, and the top-k most similar chunks are returned with their similarity scores and metadata.
The team set k=20 at the retrieval stage, then applied metadata filters (date range, document type) and a similarity score threshold of 0.72 to remove low-confidence results before surfacing the top 5 to the user. Tuning these thresholds required iteration against the labeled test set. Starting with k too low meant missing relevant results; starting with the threshold too low meant surfacing noise.
Stage 4: The Interface Layer
The search results surfaced document titles, the specific chunk that matched, a similarity score indicator, and a direct link. The team explicitly chose not to implement a generative Q&A layer in the initial release — they wanted to validate retrieval quality first before adding the complexity of an LLM response layer. This sequencing decision proved correct. Several retrieval issues surfaced in the first month that, if hidden inside a generated answer, would have been much harder to diagnose. For the eventual Q&A layer design, they later referenced How Generative AI Works: Best Practices That Actually Work to structure the retrieval-augmented generation pipeline.
Stage 5: Feedback and Iteration
The interface included thumbs-up/thumbs-down feedback on individual results. Over the first 60 days, the team collected 2,200 feedback signals. This data drove two adjustments: re-chunking of document types that performed poorly (dense financial tables, which embed poorly as raw text and were better handled with structured extraction first) and a re-weighting of metadata filters based on actual user behavior patterns.
Measurable Outcomes
Eight weeks after launch, the firm re-ran the time-study exercise.
- Average search time dropped from 23 minutes to 8 minutes per employee per day — a 65% reduction.
- Zero-result searches dropped from 35% to 6%.
- Employee-reported task completion confidence (a five-point survey item) rose from 2.9 to 4.1.
- Estimated monthly hours recovered: approximately 625 person-hours.
The outcomes weren't uniformly distributed. The practice areas with the most varied and voluminous document sets — strategy and legal — saw the largest gains. The finance team, whose documents used more structured tabular data, saw smaller initial improvements until the re-chunking adjustment in month two.
These numbers are representative of what similarly scoped implementations produce, not an outlier. Firms running keyword-only search over large, terminology-varied document sets routinely see recall improvements in the 40–70% range when moving to semantic retrieval.
Where It Almost Broke: Failure Modes Worth Naming
Several issues nearly derailed the project. Naming them explicitly is more useful than a sanitized success narrative.
Embedding drift. The team initially planned to re-embed the entire corpus quarterly to accommodate model updates. They didn't account for the fact that switching embedding model versions mid-deployment invalidates all existing vectors — you can't mix vectors from different models. They had to re-embed the full corpus when Cohere updated their model, which took a full weekend batch job and temporarily degraded performance.
Short query failure. Two- and three-word queries performed significantly worse than full-sentence queries. The embedding of "change management" is less informative than "what change management frameworks do we use for technology implementations." The team addressed this by prompting users to phrase queries as questions and by experimenting with HyDE (Hypothetical Document Embeddings) — generating a hypothetical document snippet from the short query before embedding it — which improved short-query performance by roughly 15%.
Over-reliance on similarity scores. Early versions surfaced results with scores as low as 0.58, which users found irrelevant. The 0.72 threshold emerged from empirical testing, not theory. Teams that skip this calibration step and use a generic threshold typically surface too much noise.
This kind of failure pattern echoes broader mistakes covered in 7 Common Mistakes with How Generative AI Works (and How to Avoid Them) — specifically the tendency to skip calibration in favor of faster deployment.
Generalizing the Lessons
This case doesn't generalize to every organization in every detail, but the structural lessons do.
Chunking strategy is the highest-leverage early decision. Poor chunking degrades retrieval more than model choice in many real-world tests. Invest time here before tuning anything else.
Measure before you deploy. Building a labeled test set of even 100–200 query-document pairs before selecting a model is a half-day investment that prevents weeks of post-deployment confusion.
Retrieval before generation. If the end goal is a Q&A assistant, validate the retrieval layer independently first. A bad retrieval layer fed into an LLM produces confidently wrong answers — a failure mode described in detail in Case Study: How Generative AI Works in Practice.
Feedback loops are not optional. Thumbs-up/down feedback at the result level is low friction for users and high signal for operators. Build it in from day one.
Infrastructure simplicity wins at moderate scale. pgvector handled 60,000 vectors with no performance issues. Teams with fewer than 500,000 vectors don't need a dedicated vector database. Start simple, migrate when scale demands it.
The broader capability — understanding that language can be represented as meaning-dense numbers and that proximity in that space equals conceptual similarity — extends well beyond document search. Recommendation systems, customer support routing, competitive intelligence clustering, and contract similarity analysis all run on the same underlying mechanism. The real-world examples of generative AI at work in adjacent domains show how this retrieval layer gets composed into larger, more complex systems.
Frequently Asked Questions
What is the difference between keyword search and vector search?
Keyword search retrieves documents that contain the exact terms (or close variants) in the query. Vector search retrieves documents that are semantically similar to the query — meaning it can surface relevant results even when no words overlap. Vector search uses embedding models to convert text into numerical representations, then finds the closest representations in a high-dimensional space.
How expensive is it to build an embeddings and vector search system?
For a corpus of 10,000–20,000 documents, total embedding costs via commercial APIs typically fall between $20 and $100 for the initial indexing, depending on the model and document length. Infrastructure costs vary: pgvector on an existing PostgreSQL instance adds near-zero cost, while managed vector databases like Pinecone start around $70/month for production tiers. The dominant cost at this scale is usually engineering time, not API or infrastructure spend.
Do you need a large language model to use vector search?
No. Vector search is a standalone retrieval mechanism. You need an embedding model (to convert text to vectors) and a vector store (to index and query those vectors), but you do not need a generative LLM. LLMs become relevant when you want to generate natural language answers from retrieved documents — a pattern called retrieval-augmented generation — but the retrieval layer functions independently.
What embedding model should a team start with?
For English-language professional documents, OpenAI's text-embedding-3-small and Cohere's embed-english-v3.0 are both strong starting points with good API support and documentation. If data privacy requires on-premises processing, sentence-transformers models (particularly all-mpnet-base-v2 or bge-large-en) are capable open-source alternatives. The right choice depends on domain vocabulary, privacy requirements, and whether API latency is a constraint.
How do you handle documents that change frequently?
Build incremental embedding pipelines from the beginning. When a document is updated, re-embed only that document (or its changed chunks) and update the corresponding vectors in the index. Rebuilding the entire index on every change is unnecessary and expensive. Most vector stores support upsert operations that replace vectors by a document ID, making incremental updates straightforward.
What chunk size works best for document retrieval?
There is no universal answer, but 300–600 tokens with a 10–15% overlap between adjacent chunks works well for most professional document types. Shorter chunks increase specificity but lose surrounding context; longer chunks capture more context but produce averaged-out vectors that can miss specific details. The best approach is to test two or three chunking strategies against a labeled test set before committing to one.
Key Takeaways
- Keyword search fails structurally when vocabulary varies; vector search solves this by encoding meaning, not terms.
- Chunking strategy is the highest-leverage early decision in any embeddings implementation — get it wrong and no amount of model tuning recovers it.
- Build a labeled test set of 100–200 query-document pairs before choosing an embedding model; benchmarks on generic datasets often don't predict domain-specific performance.
- Calibrate similarity score thresholds empirically; generic defaults typically surface too much noise or miss relevant results.
- Validate retrieval quality independently before adding a generative layer; retrieval failures hidden inside LLM-generated answers are significantly harder to diagnose.
- At document volumes under 500,000 vectors, pgvector on an existing PostgreSQL instance is sufficient — avoid infrastructure complexity until scale demands it.
- Incremental embedding pipelines and user feedback mechanisms are not optional add-ons; they are the operational backbone that keeps the system accurate over time.
- The same embedding and vector search infrastructure scales across use cases: document retrieval, recommendation, support routing, and contract analysis all share the same foundational mechanism.