A legal technology agency in Washington, D.C. was hired by a large law firm to modernize their research workflow. The firm's 280 attorneys spent an average of 47 minutes per research query searching through 3.8 million case documents, briefs, and memos using a keyword-based search system. Attorneys complained that finding relevant precedent required guessing the exact terminology used in older documents โ searching for "breach of fiduciary duty" missed documents that discussed "violation of trust obligations" or "failure to act in the client's best interest." The agency built a semantic search engine that understood the meaning behind queries, not just the specific words used. Research time dropped to 8 minutes per query. Attorney satisfaction with the search tool jumped from 23% to 91%. The firm estimated the productivity improvement saved $2.7 million annually in billable hours spent on research instead of client work.
Semantic search uses AI to understand the meaning and intent behind search queries and documents, matching them based on conceptual similarity rather than keyword overlap. For AI agencies, semantic search is one of the most impactful deliverables because it directly improves knowledge worker productivity โ every organization has people spending significant time searching for information, and semantic search makes that process dramatically faster and more accurate.
How Semantic Search Differs From Keyword Search
The Vocabulary Mismatch Problem
Keyword search requires the query to contain the same words as the relevant document. This works when users know the exact terminology but fails when:
- Different documents use different words for the same concept ("automobile," "vehicle," "car")
- Users search with natural language that differs from document language ("how to fix a slow computer" vs. a document titled "Performance Optimization Guide")
- Domain terminology has evolved over time (older documents use different terms than current practice)
- Users describe a concept rather than naming it ("the rule about not hiring family members" vs. "anti-nepotism policy")
Semantic search bridges this vocabulary gap by representing both queries and documents as vectors in a shared meaning space, where conceptually similar texts have similar vectors regardless of the specific words used.
When Semantic Search Adds Value
High value:
- Large document corpuses with diverse vocabulary (legal libraries, research databases, knowledge bases)
- Users with varying levels of domain expertise (not everyone knows the "correct" search terms)
- Conceptual queries ("documents about environmental impact of lithium mining" vs. keyword search "lithium mining environmental")
- Cross-language or cross-terminology search needs
Limited value:
- Exact match requirements (searching for a specific document ID, case number, or proper name)
- Highly structured data with consistent terminology
- Very small corpuses where browsing is practical
- Users who consistently use precise domain terminology
Best approach: Hybrid search that combines semantic understanding with keyword matching, giving users the benefits of both approaches.
System Architecture
Indexing Pipeline
The indexing pipeline converts documents into searchable representations.
Document ingestion:
- Accept documents from multiple sources (file systems, databases, APIs, content management systems)
- Extract text from diverse formats (PDF, Word, HTML, Markdown, plain text)
- Preserve document metadata (title, author, date, source, category) for filtering and display
Text preprocessing:
- Clean extracted text (remove boilerplate, headers, footers, page numbers)
- Segment long documents into searchable chunks (paragraphs, sections, or fixed-size windows with overlap)
- Each chunk becomes an independently searchable unit while maintaining a reference to its parent document
Embedding generation:
- Pass each text chunk through an embedding model to generate a dense vector representation
- Choose the embedding model based on domain, quality requirements, and infrastructure constraints
- Store embeddings in a vector database alongside the chunk text and metadata
Index maintenance:
- Monitor for new, updated, and deleted documents
- Update the index incrementally โ add new chunks, update changed chunks, remove deleted chunks
- Schedule full re-indexing when the embedding model changes or when significant index quality degradation is detected
Query Pipeline
The query pipeline processes user queries and returns relevant results.
Query understanding:
- Parse the raw query to identify search intent
- Extract filter criteria from the query ("contracts from 2025" contains a date filter)
- Expand the query if it is very short or ambiguous (use an LLM to rephrase the query into a more detailed search statement)
Retrieval:
- Embed the query using the same embedding model used for documents
- Execute a vector similarity search against the document index
- Apply metadata filters (date range, document type, author, category) alongside vector search
- Retrieve the top 50-100 candidates for re-ranking
Re-ranking:
- Use a cross-encoder re-ranking model to re-score the retrieved candidates
- The cross-encoder takes the query-document pair as input and produces a relevance score
- This is more accurate than embedding similarity but too slow to run against the full corpus, so it is applied only to the retrieved candidates
- Return the top 10-20 re-ranked results to the user
Result presentation:
- Show relevant text snippets with query-relevant passages highlighted
- Include metadata (document title, date, source, page number) for context
- Provide relevance scores or confidence indicators
- Link to the full document for deeper reading
Hybrid Search Architecture
Combining semantic search with keyword search produces the best results for most enterprise applications.
Implementation approaches:
Parallel retrieval with fusion:
- Run semantic search and keyword search in parallel on the same query
- Each returns a ranked list of results
- Combine the lists using reciprocal rank fusion (RRF) or a learned combination model
- Re-rank the fused results with a cross-encoder
Single-index hybrid:
- Store both dense vectors (for semantic search) and sparse vectors (for keyword search) in the same index
- Most modern vector databases (Weaviate, Qdrant, Pinecone) support hybrid queries
- Query with both dense and sparse representations simultaneously
- The database combines scores using a configurable weighting
When to weight keyword search higher:
- The corpus contains many specific identifiers (product codes, case citations, technical specifications)
- Users frequently search for exact phrases or proper nouns
- The domain has precise terminology where exact matching is important
When to weight semantic search higher:
- Users search with natural language questions
- The corpus has diverse vocabulary for similar concepts
- Users are exploring topics rather than finding specific documents
Embedding Model Selection and Optimization
Evaluation on Domain Data
Never select an embedding model based solely on benchmark performance. Evaluate on the client's actual documents and queries.
Evaluation process:
- Collect 100-200 representative queries from the client's users (from search logs, user interviews, or commonly asked questions)
- For each query, have domain experts identify the 5-10 most relevant documents from the corpus
- Embed the queries and the full corpus using each candidate model
- Compute retrieval metrics: Recall@10, NDCG@10, MRR
- Select the model with the best metrics on the client's data
Models to evaluate:
- At least one commercial embedding model (OpenAI, Cohere)
- At least one open-source model (BGE-Large, E5-Large, GTE-Large)
- A domain-specific model if available (LegalBERT embeddings for legal, BioBERT embeddings for medical)
Fine-Tuning Embeddings for Domain Performance
Fine-tuning an embedding model on the client's domain data typically improves retrieval quality by 10-25%.
Training data for fine-tuning:
- Positive pairs: Query + relevant document pairs from search logs, expert annotations, or LLM-generated relevance labels
- Hard negatives: Documents that are somewhat similar to the query but not truly relevant. Mine these from the current search system's results โ documents that appear in the top 20 results but are not relevant according to expert judgment
- Minimum dataset size: 1,000 query-document pairs, with 5-10 hard negatives per query
Fine-tuning approach:
- Start from a strong general-purpose embedding model
- Use contrastive learning with in-batch negatives plus mined hard negatives
- Fine-tune for 3-10 epochs with a low learning rate (1e-5 to 5e-5)
- Evaluate on the held-out query set after each epoch
- Select the checkpoint with the best Recall@10 on the evaluation set
Chunking Optimization
The chunking strategy has a significant impact on search quality โ often as much as the embedding model choice.
Experiment with multiple strategies:
- Fixed-size chunks (256 tokens, 512 tokens)
- Sentence-based chunks (groups of 3-5 sentences)
- Paragraph-based chunks
- Section-based chunks (using document headers to identify natural boundaries)
Evaluate each strategy on the same query set and choose the one with the best retrieval metrics. There is no universal best chunking strategy โ it depends on the document structure, query types, and embedding model.
Parent-child chunking:
- Index fine-grained chunks (paragraphs or sentences) for precise retrieval
- Maintain references from chunks to their parent sections and documents
- When a fine-grained chunk matches, return the parent section or surrounding context for more complete answers
- This gives you the precision of small chunks with the context of larger documents
Search Quality Optimization
Query Expansion
Short or ambiguous queries benefit from expansion โ adding related terms to improve recall.
LLM-based query expansion:
- Pass the user's query to an LLM with a prompt like: "Given this search query, generate 3 alternative phrasings that express the same information need in different ways."
- Embed all query variations and use the average embedding or retrieve with each variation and merge results
- This dramatically improves recall for short or ambiguous queries
Pseudo-relevance feedback:
- Execute the initial query
- Use the top 3-5 results as additional context
- Generate an expanded query using terms from the top results
- Re-execute the expanded query
- This iterative approach improves recall by 10-20%
Relevance Feedback
Incorporate user behavior to improve search quality over time.
Implicit feedback signals:
- Click-through data: Which results users click on indicates relevance
- Dwell time: How long users spend reading a document indicates relevance quality
- Search refinement: When users modify their query, the original query-document pair may not be relevant
- Download or save actions: Explicit signals of high relevance
Using feedback for improvement:
- Use click data to mine positive and negative pairs for embedding fine-tuning
- Use feedback to calibrate the re-ranking model
- Use aggregated feedback to identify queries with consistently poor results (candidates for index or model improvement)
Search Analytics
Track search quality metrics continuously to identify degradation and improvement opportunities.
Metrics to track:
- Zero-result rate: Percentage of queries that return no results. Target: below 5%.
- Click-through rate: Percentage of queries where the user clicks on at least one result. Higher is better.
- Mean reciprocal rank of clicks: The average position of the first clicked result. Lower (closer to 1) is better.
- Search abandonment rate: Percentage of queries after which the user does not click any result and does not refine the query. This indicates the search failed to find relevant content.
- Query refinement rate: Percentage of queries that the user reformulates. Some refinement is normal, but high rates indicate initial results were poor.
- Time to click: Average time from search results display to first click. Shorter times indicate results are clearly relevant.
Scaling and Performance
Latency Optimization
Enterprise search users expect results in under 1 second.
Latency breakdown for a typical semantic search query:
- Query embedding: 20-50ms (GPU) or 50-150ms (CPU)
- Vector search: 10-50ms (depends on index size and configuration)
- Re-ranking: 100-300ms (for 50-100 candidates with a cross-encoder)
- Metadata filtering and result formatting: 10-30ms
- Total: 150-500ms
Optimization strategies:
- Cache query embeddings for frequent queries
- Use approximate nearest neighbor search (HNSW) instead of exact search
- Limit re-ranking to the top 20-50 candidates instead of 100
- Use a lightweight re-ranking model (MiniLM-based cross-encoder instead of large models)
- Pre-compute and cache results for the most common queries
Index Management at Scale
For corpuses with millions of documents and hundreds of millions of chunks:
- Shard the index across multiple nodes for parallel search
- Replicate shards for read throughput and availability
- Use product quantization to reduce memory requirements (4-8x memory savings with 2-5% recall reduction)
- Implement tiered storage โ keep hot data (recent documents, frequently accessed documents) on fast storage and cold data on cheaper storage
Your Next Step
Collect the 50 most common search queries from your client's current search system (from search logs or user interviews). For each query, have a domain expert identify the 5 most relevant documents in the corpus. This evaluation set is your ground truth for measuring any search improvement. Run these queries against the current keyword search system and compute Recall@10. Then run them against a basic semantic search prototype (off-the-shelf embeddings, no fine-tuning, no re-ranking) and compare. This side-by-side comparison quantifies the potential improvement from semantic search and identifies the query types where semantic search adds the most value. Use these numbers to scope the project, set accuracy targets, and build the business case.