Building Semantic Search Engines — Delivering Search Systems That Understand Meaning, Not Just Keywords

A legal technology agency in Washington, D.C. was hired by a large law firm to modernize their research workflow. The firm's 280 attorneys spent an average of 47 minutes per research query searching through 3.8 million case documents, briefs, and memos using a keyword-based search system. Attorneys complained that finding relevant precedent required guessing the exact terminology used in older documents — searching for "breach of fiduciary duty" missed documents that discussed "violation of trust obligations" or "failure to act in the client's best interest." The agency built a semantic search engine that understood the meaning behind queries, not just the specific words used. Research time dropped to 8 minutes per query. Attorney satisfaction with the search tool jumped from 23% to 91%. The firm estimated the productivity improvement saved $2.7 million annually in billable hours spent on research instead of client work.

Semantic search uses AI to understand the meaning and intent behind search queries and documents, matching them based on conceptual similarity rather than keyword overlap. For AI agencies, semantic search is one of the most impactful deliverables because it directly improves knowledge worker productivity — every organization has people spending significant time searching for information, and semantic search makes that process dramatically faster and more accurate.

How Semantic Search Differs From Keyword Search

The Vocabulary Mismatch Problem

Keyword search requires the query to contain the same words as the relevant document. This works when users know the exact terminology but fails when:

Different documents use different words for the same concept ("automobile," "vehicle," "car")
Users search with natural language that differs from document language ("how to fix a slow computer" vs. a document titled "Performance Optimization Guide")
Domain terminology has evolved over time (older documents use different terms than current practice)
Users describe a concept rather than naming it ("the rule about not hiring family members" vs. "anti-nepotism policy")

Semantic search bridges this vocabulary gap by representing both queries and documents as vectors in a shared meaning space, where conceptually similar texts have similar vectors regardless of the specific words used.

When Semantic Search Adds Value

High value:

Large document corpuses with diverse vocabulary (legal libraries, research databases, knowledge bases)
Users with varying levels of domain expertise (not everyone knows the "correct" search terms)
Conceptual queries ("documents about environmental impact of lithium mining" vs. keyword search "lithium mining environmental")
Cross-language or cross-terminology search needs

Limited value:

Exact match requirements (searching for a specific document ID, case number, or proper name)
Highly structured data with consistent terminology
Very small corpuses where browsing is practical
Users who consistently use precise domain terminology

Best approach: Hybrid search that combines semantic understanding with keyword matching, giving users the benefits of both approaches.

System Architecture

Indexing Pipeline

The indexing pipeline converts documents into searchable representations.

Document ingestion:

Accept documents from multiple sources (file systems, databases, APIs, content management systems)
Extract text from diverse formats (PDF, Word, HTML, Markdown, plain text)
Preserve document metadata (title, author, date, source, category) for filtering and display

Text preprocessing:

Clean extracted text (remove boilerplate, headers, footers, page numbers)
Segment long documents into searchable chunks (paragraphs, sections, or fixed-size windows with overlap)
Each chunk becomes an independently searchable unit while maintaining a reference to its parent document

Embedding generation:

Pass each text chunk through an embedding model to generate a dense vector representation
Choose the embedding model based on domain, quality requirements, and infrastructure constraints
Store embeddings in a vector database alongside the chunk text and metadata

Index maintenance:

Monitor for new, updated, and deleted documents
Update the index incrementally — add new chunks, update changed chunks, remove deleted chunks
Schedule full re-indexing when the embedding model changes or when significant index quality degradation is detected

Query Pipeline

The query pipeline processes user queries and returns relevant results.

Query understanding:

Parse the raw query to identify search intent
Extract filter criteria from the query ("contracts from 2025" contains a date filter)
Expand the query if it is very short or ambiguous (use an LLM to rephrase the query into a more detailed search statement)

Retrieval:

Embed the query using the same embedding model used for documents
Execute a vector similarity search against the document index
Apply metadata filters (date range, document type, author, category) alongside vector search
Retrieve the top 50-100 candidates for re-ranking

Re-ranking:

Use a cross-encoder re-ranking model to re-score the retrieved candidates
The cross-encoder takes the query-document pair as input and produces a relevance score
This is more accurate than embedding similarity but too slow to run against the full corpus, so it is applied only to the retrieved candidates
Return the top 10-20 re-ranked results to the user

Result presentation:

Show relevant text snippets with query-relevant passages highlighted
Include metadata (document title, date, source, page number) for context
Provide relevance scores or confidence indicators
Link to the full document for deeper reading

Hybrid Search Architecture

Combining semantic search with keyword search produces the best results for most enterprise applications.

Implementation approaches:

Parallel retrieval with fusion:

Run semantic search and keyword search in parallel on the same query
Each returns a ranked list of results
Combine the lists using reciprocal rank fusion (RRF) or a learned combination model
Re-rank the fused results with a cross-encoder

Single-index hybrid:

Store both dense vectors (for semantic search) and sparse vectors (for keyword search) in the same index
Most modern vector databases (Weaviate, Qdrant, Pinecone) support hybrid queries
Query with both dense and sparse representations simultaneously
The database combines scores using a configurable weighting

When to weight keyword search higher:

The corpus contains many specific identifiers (product codes, case citations, technical specifications)
Users frequently search for exact phrases or proper nouns
The domain has precise terminology where exact matching is important

When to weight semantic search higher:

Users search with natural language questions
The corpus has diverse vocabulary for similar concepts
Users are exploring topics rather than finding specific documents

Embedding Model Selection and Optimization

Evaluation on Domain Data

Never select an embedding model based solely on benchmark performance. Evaluate on the client's actual documents and queries.

Evaluation process:

Collect 100-200 representative queries from the client's users (from search logs, user interviews, or commonly asked questions)
For each query, have domain experts identify the 5-10 most relevant documents from the corpus
Embed the queries and the full corpus using each candidate model
Compute retrieval metrics: Recall@10, NDCG@10, MRR
Select the model with the best metrics on the client's data

Models to evaluate:

At least one commercial embedding model (OpenAI, Cohere)
At least one open-source model (BGE-Large, E5-Large, GTE-Large)
A domain-specific model if available (LegalBERT embeddings for legal, BioBERT embeddings for medical)

Fine-Tuning Embeddings for Domain Performance

Fine-tuning an embedding model on the client's domain data typically improves retrieval quality by 10-25%.

Training data for fine-tuning:

Positive pairs: Query + relevant document pairs from search logs, expert annotations, or LLM-generated relevance labels
Hard negatives: Documents that are somewhat similar to the query but not truly relevant. Mine these from the current search system's results — documents that appear in the top 20 results but are not relevant according to expert judgment
Minimum dataset size: 1,000 query-document pairs, with 5-10 hard negatives per query

Fine-tuning approach:

Start from a strong general-purpose embedding model
Use contrastive learning with in-batch negatives plus mined hard negatives
Fine-tune for 3-10 epochs with a low learning rate (1e-5 to 5e-5)
Evaluate on the held-out query set after each epoch
Select the checkpoint with the best Recall@10 on the evaluation set

Chunking Optimization

The chunking strategy has a significant impact on search quality — often as much as the embedding model choice.

Experiment with multiple strategies:

Fixed-size chunks (256 tokens, 512 tokens)
Sentence-based chunks (groups of 3-5 sentences)
Paragraph-based chunks
Section-based chunks (using document headers to identify natural boundaries)

Evaluate each strategy on the same query set and choose the one with the best retrieval metrics. There is no universal best chunking strategy — it depends on the document structure, query types, and embedding model.

Parent-child chunking:

Index fine-grained chunks (paragraphs or sentences) for precise retrieval
Maintain references from chunks to their parent sections and documents
When a fine-grained chunk matches, return the parent section or surrounding context for more complete answers
This gives you the precision of small chunks with the context of larger documents

Search Quality Optimization

Query Expansion

Short or ambiguous queries benefit from expansion — adding related terms to improve recall.

LLM-based query expansion:

Pass the user's query to an LLM with a prompt like: "Given this search query, generate 3 alternative phrasings that express the same information need in different ways."
Embed all query variations and use the average embedding or retrieve with each variation and merge results
This dramatically improves recall for short or ambiguous queries

Pseudo-relevance feedback:

Execute the initial query
Use the top 3-5 results as additional context
Generate an expanded query using terms from the top results
Re-execute the expanded query
This iterative approach improves recall by 10-20%

Relevance Feedback

Incorporate user behavior to improve search quality over time.

Implicit feedback signals:

Click-through data: Which results users click on indicates relevance
Dwell time: How long users spend reading a document indicates relevance quality
Search refinement: When users modify their query, the original query-document pair may not be relevant
Download or save actions: Explicit signals of high relevance

Using feedback for improvement:

Use click data to mine positive and negative pairs for embedding fine-tuning
Use feedback to calibrate the re-ranking model
Use aggregated feedback to identify queries with consistently poor results (candidates for index or model improvement)

Search Analytics

Track search quality metrics continuously to identify degradation and improvement opportunities.

Metrics to track:

Zero-result rate: Percentage of queries that return no results. Target: below 5%.
Click-through rate: Percentage of queries where the user clicks on at least one result. Higher is better.
Mean reciprocal rank of clicks: The average position of the first clicked result. Lower (closer to 1) is better.
Search abandonment rate: Percentage of queries after which the user does not click any result and does not refine the query. This indicates the search failed to find relevant content.
Query refinement rate: Percentage of queries that the user reformulates. Some refinement is normal, but high rates indicate initial results were poor.
Time to click: Average time from search results display to first click. Shorter times indicate results are clearly relevant.

Scaling and Performance

Latency Optimization

Enterprise search users expect results in under 1 second.

Latency breakdown for a typical semantic search query:

Query embedding: 20-50ms (GPU) or 50-150ms (CPU)
Vector search: 10-50ms (depends on index size and configuration)
Re-ranking: 100-300ms (for 50-100 candidates with a cross-encoder)
Metadata filtering and result formatting: 10-30ms
Total: 150-500ms

Optimization strategies:

Cache query embeddings for frequent queries
Use approximate nearest neighbor search (HNSW) instead of exact search
Limit re-ranking to the top 20-50 candidates instead of 100
Use a lightweight re-ranking model (MiniLM-based cross-encoder instead of large models)
Pre-compute and cache results for the most common queries

Index Management at Scale

For corpuses with millions of documents and hundreds of millions of chunks:

Shard the index across multiple nodes for parallel search
Replicate shards for read throughput and availability
Use product quantization to reduce memory requirements (4-8x memory savings with 2-5% recall reduction)
Implement tiered storage — keep hot data (recent documents, frequently accessed documents) on fast storage and cold data on cheaper storage

Your Next Step

Collect the 50 most common search queries from your client's current search system (from search logs or user interviews). For each query, have a domain expert identify the 5 most relevant documents in the corpus. This evaluation set is your ground truth for measuring any search improvement. Run these queries against the current keyword search system and compute Recall@10. Then run them against a basic semantic search prototype (off-the-shelf embeddings, no fine-tuning, no re-ranking) and compare. This side-by-side comparison quantifies the potential improvement from semantic search and identifies the query types where semantic search adds the most value. Use these numbers to scope the project, set accuracy targets, and build the business case.

How Semantic Search Differs From Keyword Search

The Vocabulary Mismatch Problem

Keyword search requires the query to contain the same words as the relevant document. This works when users know the exact terminology but fails when:

Different documents use different words for the same concept ("automobile," "vehicle," "car")
Users search with natural language that differs from document language ("how to fix a slow computer" vs. a document titled "Performance Optimization Guide")
Domain terminology has evolved over time (older documents use different terms than current practice)
Users describe a concept rather than naming it ("the rule about not hiring family members" vs. "anti-nepotism policy")

When Semantic Search Adds Value

High value:

Large document corpuses with diverse vocabulary (legal libraries, research databases, knowledge bases)
Users with varying levels of domain expertise (not everyone knows the "correct" search terms)
Conceptual queries ("documents about environmental impact of lithium mining" vs. keyword search "lithium mining environmental")
Cross-language or cross-terminology search needs

Limited value:

Exact match requirements (searching for a specific document ID, case number, or proper name)
Highly structured data with consistent terminology
Very small corpuses where browsing is practical
Users who consistently use precise domain terminology

Best approach: Hybrid search that combines semantic understanding with keyword matching, giving users the benefits of both approaches.

System Architecture

Indexing Pipeline

The indexing pipeline converts documents into searchable representations.

Document ingestion:

Accept documents from multiple sources (file systems, databases, APIs, content management systems)
Extract text from diverse formats (PDF, Word, HTML, Markdown, plain text)
Preserve document metadata (title, author, date, source, category) for filtering and display

Text preprocessing:

Clean extracted text (remove boilerplate, headers, footers, page numbers)
Segment long documents into searchable chunks (paragraphs, sections, or fixed-size windows with overlap)
Each chunk becomes an independently searchable unit while maintaining a reference to its parent document

Embedding generation:

Pass each text chunk through an embedding model to generate a dense vector representation
Choose the embedding model based on domain, quality requirements, and infrastructure constraints
Store embeddings in a vector database alongside the chunk text and metadata

Index maintenance:

Monitor for new, updated, and deleted documents
Update the index incrementally — add new chunks, update changed chunks, remove deleted chunks
Schedule full re-indexing when the embedding model changes or when significant index quality degradation is detected

Query Pipeline

The query pipeline processes user queries and returns relevant results.

Query understanding:

Parse the raw query to identify search intent
Extract filter criteria from the query ("contracts from 2025" contains a date filter)
Expand the query if it is very short or ambiguous (use an LLM to rephrase the query into a more detailed search statement)

Retrieval:

Embed the query using the same embedding model used for documents
Execute a vector similarity search against the document index
Apply metadata filters (date range, document type, author, category) alongside vector search
Retrieve the top 50-100 candidates for re-ranking

Re-ranking:

Use a cross-encoder re-ranking model to re-score the retrieved candidates
The cross-encoder takes the query-document pair as input and produces a relevance score
This is more accurate than embedding similarity but too slow to run against the full corpus, so it is applied only to the retrieved candidates
Return the top 10-20 re-ranked results to the user

Result presentation:

Show relevant text snippets with query-relevant passages highlighted
Include metadata (document title, date, source, page number) for context
Provide relevance scores or confidence indicators
Link to the full document for deeper reading

Hybrid Search Architecture

Combining semantic search with keyword search produces the best results for most enterprise applications.

Implementation approaches:

Parallel retrieval with fusion:

Run semantic search and keyword search in parallel on the same query
Each returns a ranked list of results
Combine the lists using reciprocal rank fusion (RRF) or a learned combination model
Re-rank the fused results with a cross-encoder

Single-index hybrid:

Store both dense vectors (for semantic search) and sparse vectors (for keyword search) in the same index
Most modern vector databases (Weaviate, Qdrant, Pinecone) support hybrid queries
Query with both dense and sparse representations simultaneously
The database combines scores using a configurable weighting

When to weight keyword search higher:

The corpus contains many specific identifiers (product codes, case citations, technical specifications)
Users frequently search for exact phrases or proper nouns
The domain has precise terminology where exact matching is important

When to weight semantic search higher:

Users search with natural language questions
The corpus has diverse vocabulary for similar concepts
Users are exploring topics rather than finding specific documents

Embedding Model Selection and Optimization

Evaluation on Domain Data

Never select an embedding model based solely on benchmark performance. Evaluate on the client's actual documents and queries.

Evaluation process:

Collect 100-200 representative queries from the client's users (from search logs, user interviews, or commonly asked questions)
For each query, have domain experts identify the 5-10 most relevant documents from the corpus
Embed the queries and the full corpus using each candidate model
Compute retrieval metrics: Recall@10, NDCG@10, MRR
Select the model with the best metrics on the client's data

Models to evaluate:

At least one commercial embedding model (OpenAI, Cohere)
At least one open-source model (BGE-Large, E5-Large, GTE-Large)
A domain-specific model if available (LegalBERT embeddings for legal, BioBERT embeddings for medical)

Fine-Tuning Embeddings for Domain Performance

Fine-tuning an embedding model on the client's domain data typically improves retrieval quality by 10-25%.

Training data for fine-tuning:

Positive pairs: Query + relevant document pairs from search logs, expert annotations, or LLM-generated relevance labels
Hard negatives: Documents that are somewhat similar to the query but not truly relevant. Mine these from the current search system's results — documents that appear in the top 20 results but are not relevant according to expert judgment
Minimum dataset size: 1,000 query-document pairs, with 5-10 hard negatives per query

Fine-tuning approach:

Start from a strong general-purpose embedding model
Use contrastive learning with in-batch negatives plus mined hard negatives
Fine-tune for 3-10 epochs with a low learning rate (1e-5 to 5e-5)
Evaluate on the held-out query set after each epoch
Select the checkpoint with the best Recall@10 on the evaluation set

Chunking Optimization

The chunking strategy has a significant impact on search quality — often as much as the embedding model choice.

Experiment with multiple strategies:

Fixed-size chunks (256 tokens, 512 tokens)
Sentence-based chunks (groups of 3-5 sentences)
Paragraph-based chunks
Section-based chunks (using document headers to identify natural boundaries)

Parent-child chunking:

Index fine-grained chunks (paragraphs or sentences) for precise retrieval
Maintain references from chunks to their parent sections and documents
When a fine-grained chunk matches, return the parent section or surrounding context for more complete answers
This gives you the precision of small chunks with the context of larger documents

Search Quality Optimization

Query Expansion

Short or ambiguous queries benefit from expansion — adding related terms to improve recall.

LLM-based query expansion:

Pass the user's query to an LLM with a prompt like: "Given this search query, generate 3 alternative phrasings that express the same information need in different ways."
Embed all query variations and use the average embedding or retrieve with each variation and merge results
This dramatically improves recall for short or ambiguous queries

Pseudo-relevance feedback:

Execute the initial query
Use the top 3-5 results as additional context
Generate an expanded query using terms from the top results
Re-execute the expanded query
This iterative approach improves recall by 10-20%

Relevance Feedback

Incorporate user behavior to improve search quality over time.

Implicit feedback signals:

Click-through data: Which results users click on indicates relevance
Dwell time: How long users spend reading a document indicates relevance quality
Search refinement: When users modify their query, the original query-document pair may not be relevant
Download or save actions: Explicit signals of high relevance

Using feedback for improvement:

Use click data to mine positive and negative pairs for embedding fine-tuning
Use feedback to calibrate the re-ranking model
Use aggregated feedback to identify queries with consistently poor results (candidates for index or model improvement)

Search Analytics

Track search quality metrics continuously to identify degradation and improvement opportunities.

Metrics to track:

Zero-result rate: Percentage of queries that return no results. Target: below 5%.
Click-through rate: Percentage of queries where the user clicks on at least one result. Higher is better.
Mean reciprocal rank of clicks: The average position of the first clicked result. Lower (closer to 1) is better.
Search abandonment rate: Percentage of queries after which the user does not click any result and does not refine the query. This indicates the search failed to find relevant content.
Query refinement rate: Percentage of queries that the user reformulates. Some refinement is normal, but high rates indicate initial results were poor.
Time to click: Average time from search results display to first click. Shorter times indicate results are clearly relevant.

Scaling and Performance

Latency Optimization

Enterprise search users expect results in under 1 second.

Latency breakdown for a typical semantic search query:

Query embedding: 20-50ms (GPU) or 50-150ms (CPU)
Vector search: 10-50ms (depends on index size and configuration)
Re-ranking: 100-300ms (for 50-100 candidates with a cross-encoder)
Metadata filtering and result formatting: 10-30ms
Total: 150-500ms

Optimization strategies:

Cache query embeddings for frequent queries
Use approximate nearest neighbor search (HNSW) instead of exact search
Limit re-ranking to the top 20-50 candidates instead of 100
Use a lightweight re-ranking model (MiniLM-based cross-encoder instead of large models)
Pre-compute and cache results for the most common queries

Index Management at Scale

For corpuses with millions of documents and hundreds of millions of chunks:

Shard the index across multiple nodes for parallel search
Replicate shards for read throughput and availability
Use product quantization to reduce memory requirements (4-8x memory savings with 2-5% recall reduction)
Implement tiered storage — keep hot data (recent documents, frequently accessed documents) on fast storage and cold data on cheaper storage

Building Semantic Search Engines — Delivering Search Systems That Understand Meaning, Not Just Keywords

How Semantic Search Differs From Keyword Search

The Vocabulary Mismatch Problem

When Semantic Search Adds Value

System Architecture

Indexing Pipeline

Query Pipeline

Hybrid Search Architecture

Embedding Model Selection and Optimization

Evaluation on Domain Data

Fine-Tuning Embeddings for Domain Performance

Chunking Optimization

Search Quality Optimization

Query Expansion

Relevance Feedback

Search Analytics

Scaling and Performance

Latency Optimization

Index Management at Scale

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?

Building Semantic Search Engines — Delivering Search Systems That Understand Meaning, Not Just Keywords

How Semantic Search Differs From Keyword Search

The Vocabulary Mismatch Problem

When Semantic Search Adds Value

System Architecture

Indexing Pipeline

Query Pipeline

Hybrid Search Architecture

Embedding Model Selection and Optimization

Evaluation on Domain Data

Fine-Tuning Embeddings for Domain Performance

Chunking Optimization

Search Quality Optimization

Query Expansion

Relevance Feedback

Search Analytics

Scaling and Performance

Latency Optimization

Index Management at Scale

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?