The Plumbing Behind Smart Search, Recommendations, and RAG

Embeddings and vector search sound like infrastructure plumbing — the kind of thing only machine learning engineers care about. But once you see what they actually do, you realize they're the mechanism behind some of the most useful AI features being shipped right now: smart document search, product recommendation, fraud detection, customer support routing, and the retrieval layer inside most serious RAG (retrieval-augmented generation) pipelines. If you're building with AI or advising clients who are, you need a working mental model of this technology and — more importantly — a clear picture of where it succeeds and where it breaks down.

This article walks through specific, concrete scenarios: what was built, how the embedding and vector search layer was set up, and what made it work or fail. The goal isn't to make you a machine learning researcher. It's to make you a sharper decision-maker when this technology is on the table.

Before diving into examples, one quick frame: an embedding is a list of numbers (a vector) that represents the meaning of a piece of content — a sentence, product description, image, or user behavior pattern. Similar things get similar vectors. Vector search is the process of finding stored vectors that are closest to a query vector. Together, they let you search by meaning rather than by keyword. That distinction drives everything in the examples below.

Semantic Document Search at a Law Firm

A mid-size litigation firm had 15 years of case memos, deposition summaries, and internal research stored in a shared drive. Associates were spending 4–6 hours per research task because keyword search missed synonyms and context.

What they built

Every document was chunked into ~500-word segments and run through an embedding model (OpenAI's text-embedding-3-small in this case, though similar results come from open-source alternatives like all-MiniLM-L6-v2). The resulting vectors were stored in a vector database — Pinecone in this deployment. When an associate typed a research query, it was embedded using the same model, and the top 10–20 nearest chunks were retrieved by cosine similarity.

What made it work

Chunking strategy mattered more than model choice. Chunks that cut mid-argument performed poorly. Chunking by logical section (headings, paragraph breaks) dramatically improved result quality.
Metadata filtering was essential. Without filtering by case type or date range, a query about employment discrimination might surface 10-year-old, jurisdiction-irrelevant results. Adding structured metadata filters reduced noise significantly.
Hybrid search helped. A pure semantic search occasionally missed exact case citations. Combining vector search with BM25 keyword search (a technique called hybrid retrieval) caught both.

Where it failed first

The initial build skipped the metadata layer entirely. Associates didn't trust results because they couldn't tell if a retrieved chunk was current or applicable to their jurisdiction. Trust is a hard thing to rebuild. The fix was adding a visible source summary (case name, date, jurisdiction) to every result card — a UX fix as much as a technical one.

E-Commerce Product Recommendation

A mid-market apparel retailer with roughly 8,000 SKUs wanted to improve their "you might also like" recommendations, which were previously driven by category rules.

The embedding approach

Product descriptions, material details, and style tags were concatenated and embedded. Separately, user click and purchase sequences were embedded using a session-based model to create behavioral vectors. At query time — when a user viewed a product — both the product vector and the user session vector were used to retrieve nearest neighbors.

What worked

Product-to-product similarity worked almost immediately. Searching for a linen blazer reliably surfaced other relaxed-fit summer pieces rather than just "other blazers." The semantic layer understood material, occasion, and silhouette in ways category tags never did.

What failed

Behavioral vectors degraded quickly for low-activity users. New visitors had no behavioral signal, so recommendations defaulted to popularity, undermining the whole point. The solution was a cold-start fallback: for sessions under three interactions, use only the product vector; blend in behavioral vectors as session length grew. This kind of graceful degradation needs to be designed in from the start, not bolted on after complaints.

A subtler failure: embeddings were generated once during a product catalog refresh. When inventory changed — items discontinued, new seasonal drops added — the vector index went stale. Real-time or nightly re-indexing for active SKUs became mandatory.

Customer Support Ticket Routing

A SaaS company with four product lines and a support team spread across three regions was manually triaging ~1,200 tickets per week. Routing accuracy was around 72%, meaning roughly 1 in 4 tickets went to the wrong queue first.

How embeddings were applied

Historical resolved tickets (about 40,000 of them) were embedded and stored with labels for product area, issue type, and priority level. Incoming tickets were embedded on receipt and matched against this labeled corpus using nearest-neighbor search. The top-match label became a routing suggestion, which a human agent could accept or override.

Accuracy gains and limits

Routing accuracy improved to approximately 89% within three months of deployment — a meaningful gain, but not perfect. The remaining errors clustered around:

Novel issue types with no historical precedent (new feature bugs, policy changes)
Ambiguous tickets that legitimately belonged in two queues
Non-English submissions when the embedding model was trained predominantly on English text

The ambiguous-ticket problem was addressed by flagging any ticket where the top two nearest neighbors came from different product areas — these went to a senior triage agent automatically rather than being routed confidently in the wrong direction.

Understanding why this works at the inference level connects back to how the underlying generation models process language — if that layer is new to you, How Generative AI Works: A Beginner's Guide provides a useful foundation before going deeper into retrieval pipelines.

RAG (Retrieval-Augmented Generation) for Internal Knowledge Bases

RAG is probably where most professionals encounter embeddings and vector search in practice right now. The pattern: rather than fine-tuning a model on proprietary data (expensive, slow, requires retraining on updates), you store your data as embeddings and retrieve relevant chunks at query time to give the LLM real context. A Step-by-Step Approach to How Generative AI Works covers the generative side; this section focuses on what makes the retrieval layer succeed or break.

A consulting firm's internal knowledge bot

A 200-person consulting firm built a chatbot over their methodology documents, client-facing templates, and past engagement summaries. Associates could ask questions like "what's our standard approach to supply chain resilience assessments?" and get grounded, cited answers.

The retrieval failures that caused real problems

Chunk size vs. context tension. Small chunks (under 200 tokens) retrieved precise sentences but stripped context. The LLM received a fragment it couldn't interpret correctly. Larger chunks (700–900 tokens) gave more context but diluted relevance. A sliding window approach — chunks with ~20% overlap — reduced context loss without ballooning retrieval noise.

Embedding model mismatch. The firm initially used a general-purpose embedding model for documents written in dense consulting jargon and acronyms. Retrieval quality was mediocre because the model's concept of "PMO governance" and the firm's internal usage didn't align. Switching to a model that could be fine-tuned on their vocabulary, or using a more capable base model, improved precision noticeably.

The "lost in the middle" problem. When 10+ chunks were passed to the LLM, the most relevant ones sometimes appeared in positions 4–7 in the context window — and research on LLM attention suggests models underweight middle-context information. Limiting retrieved chunks to 3–5 and ranking them carefully by a re-ranking model (like Cohere Rerank) improved answer quality more than retrieving more chunks.

One of the common mistakes with generative AI implementations is treating retrieval quality as a model problem when it's often a data preparation problem. Garbage chunking and stale indexes cause more failures than model capability limits.

Fraud Detection at a Fintech Startup

A payments company used transaction embeddings to identify anomalous behavior patterns — not to replace rule-based fraud detection, but to surface cases that fell through rules.

The setup

Transaction sequences (merchant category, amount, location delta, time of day, device fingerprint) were encoded as vectors representing a user's "normal behavior pattern." New transactions were embedded in real time and compared against the user's historical cluster. Large deviations triggered a secondary review flag.

What worked and what didn't

This approach caught "soft" fraud: account takeovers where the attacker followed the rules (no large sudden transfers) but transacted in categories outside the victim's normal pattern. Rule-based systems missed these; vector similarity flagged them.

The failure mode was drift. A user who moved cities or changed spending habits would generate false positives for 3–4 weeks until their behavioral cluster updated. Tuning the sensitivity threshold required balancing fraud catch rate against customer friction — a business decision, not a technical one. The best practices for generative AI deployment principle of treating threshold-setting as a product decision rather than an engineering default applies directly here.

Image and Multimodal Search

Embeddings aren't limited to text. A home furnishings retailer implemented visual search: customers could upload a photo of a room or a piece of furniture they liked, and the system would retrieve visually similar products.

What worked

CLIP (a multimodal model from OpenAI) embeds both images and text into the same vector space. This meant a text query ("mid-century modern side table, walnut finish") and an image query (a photo of a similar table) could both retrieve relevant products from the same index. The unified index was a significant operational simplification.

What failed initially

Product photography was inconsistent — some items shot on white backgrounds, others styled in room settings. The model's similarity calculations were partly influenced by photographic style rather than product features. Standardizing product photography improved retrieval quality more than any model upgrade did.

This is a pattern that recurs across every domain: data quality and consistency affect embedding quality directly. No amount of model sophistication compensates for inconsistent, noisy source data. The real-world generative AI examples that hold up over time share one common feature: someone invested heavily in data preparation.

Frequently Asked Questions

What's the difference between keyword search and vector search?

Keyword search matches exact terms or close variants; vector search matches by semantic meaning. A keyword search for "cardiac event" won't reliably return documents about "heart attack," but a vector search will. In practice, the best production systems combine both approaches (hybrid search) because each catches failures the other misses.

Do you need a dedicated vector database, or can you use what you already have?

For small datasets (under a few hundred thousand vectors), extensions like pgvector for PostgreSQL often work fine and eliminate operational overhead. As scale and query-per-second requirements grow, dedicated vector databases like Pinecone, Weaviate, or Qdrant offer better performance tuning, filtering, and index management. Choose based on actual scale requirements, not theoretical future scale.

How do you know if your embedding model is good enough for your use case?

Benchmark it on your actual data. Embed 100–200 representative queries and their known-good matches, then measure retrieval precision at various k values (top-1, top-5, top-10). If a general model performs poorly on your domain-specific vocabulary, consider a fine-tuned model or a larger general model before building more infrastructure.

What causes embedding search to return irrelevant results?

The most common causes are: poorly sized chunks that strip context, stale indexes that haven't been updated as source data changed, embedding model mismatch with domain vocabulary, and lack of metadata filtering that allows outdated or irrelevant-jurisdiction content to surface. In most production failures, the problem is data preparation, not model capability.

Is vector search the same as AI search?

Not exactly. Vector search is one technique within AI-powered search. AI search systems typically combine vector search, keyword/BM25 search, re-ranking models, and sometimes generative summarization. "AI search" is a product category; vector search is one component of how it works under the hood.

Key Takeaways

Embedding quality is determined more by chunking strategy, data consistency, and metadata design than by model choice alone.
Hybrid search — combining vector similarity with keyword retrieval — outperforms pure vector search in most production scenarios.
Cold-start failures (new users, new items, novel issue types) need explicit fallback logic built in from the start.
Behavioral and product embeddings degrade when indexes go stale; scheduled re-indexing is operational hygiene, not optional.
The "lost in the middle" problem means that retrieving more chunks doesn't always improve LLM answer quality — re-ranking a smaller set usually outperforms passing large context windows.
Threshold-setting in embedding-based classification (fraud, routing, recommendations) is a business and product decision, not a default engineering parameter.
Data preparation problems cause more production failures in vector search than model limitations do.

Semantic Document Search at a Law Firm

What they built

What made it work

Chunking strategy mattered more than model choice. Chunks that cut mid-argument performed poorly. Chunking by logical section (headings, paragraph breaks) dramatically improved result quality.
Metadata filtering was essential. Without filtering by case type or date range, a query about employment discrimination might surface 10-year-old, jurisdiction-irrelevant results. Adding structured metadata filters reduced noise significantly.
Hybrid search helped. A pure semantic search occasionally missed exact case citations. Combining vector search with BM25 keyword search (a technique called hybrid retrieval) caught both.

Where it failed first

E-Commerce Product Recommendation

A mid-market apparel retailer with roughly 8,000 SKUs wanted to improve their "you might also like" recommendations, which were previously driven by category rules.

The embedding approach

What worked

What failed

Customer Support Ticket Routing

How embeddings were applied

Accuracy gains and limits

Routing accuracy improved to approximately 89% within three months of deployment — a meaningful gain, but not perfect. The remaining errors clustered around:

Novel issue types with no historical precedent (new feature bugs, policy changes)
Ambiguous tickets that legitimately belonged in two queues
Non-English submissions when the embedding model was trained predominantly on English text

RAG (Retrieval-Augmented Generation) for Internal Knowledge Bases

A consulting firm's internal knowledge bot

The retrieval failures that caused real problems

Fraud Detection at a Fintech Startup

A payments company used transaction embeddings to identify anomalous behavior patterns — not to replace rule-based fraud detection, but to surface cases that fell through rules.

The setup

What worked and what didn't

Image and Multimodal Search

What worked

What failed initially

Frequently Asked Questions

What's the difference between keyword search and vector search?

Do you need a dedicated vector database, or can you use what you already have?

How do you know if your embedding model is good enough for your use case?

What causes embedding search to return irrelevant results?

Is vector search the same as AI search?

Key Takeaways

Embedding quality is determined more by chunking strategy, data consistency, and metadata design than by model choice alone.
Hybrid search — combining vector similarity with keyword retrieval — outperforms pure vector search in most production scenarios.
Cold-start failures (new users, new items, novel issue types) need explicit fallback logic built in from the start.
Behavioral and product embeddings degrade when indexes go stale; scheduled re-indexing is operational hygiene, not optional.
The "lost in the middle" problem means that retrieving more chunks doesn't always improve LLM answer quality — re-ranking a smaller set usually outperforms passing large context windows.
Threshold-setting in embedding-based classification (fraud, routing, recommendations) is a business and product decision, not a default engineering parameter.
Data preparation problems cause more production failures in vector search than model limitations do.

The Plumbing Behind Smart Search, Recommendations, and RAG

Semantic Document Search at a Law Firm

What they built

What made it work

Where it failed first

E-Commerce Product Recommendation

The embedding approach

What worked

What failed

Customer Support Ticket Routing

How embeddings were applied

Accuracy gains and limits

RAG (Retrieval-Augmented Generation) for Internal Knowledge Bases

A consulting firm's internal knowledge bot

The retrieval failures that caused real problems

Fraud Detection at a Fintech Startup

The setup

What worked and what didn't

Image and Multimodal Search

What worked

What failed initially

Frequently Asked Questions

What's the difference between keyword search and vector search?

Do you need a dedicated vector database, or can you use what you already have?

How do you know if your embedding model is good enough for your use case?

What causes embedding search to return irrelevant results?

Is vector search the same as AI search?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

The Plumbing Behind Smart Search, Recommendations, and RAG

Semantic Document Search at a Law Firm

What they built

What made it work

Where it failed first

E-Commerce Product Recommendation

The embedding approach

What worked

What failed

Customer Support Ticket Routing

How embeddings were applied

Accuracy gains and limits

RAG (Retrieval-Augmented Generation) for Internal Knowledge Bases

A consulting firm's internal knowledge bot

The retrieval failures that caused real problems

Fraud Detection at a Fintech Startup

The setup

What worked and what didn't

Image and Multimodal Search

What worked

What failed initially

Frequently Asked Questions

What's the difference between keyword search and vector search?

Do you need a dedicated vector database, or can you use what you already have?

How do you know if your embedding model is good enough for your use case?

What causes embedding search to return irrelevant results?

Is vector search the same as AI search?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?