AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

How Semantic Search Differs From Keyword SearchThe Vocabulary Mismatch ProblemWhen Semantic Search Adds ValueSystem ArchitectureIndexing PipelineQuery PipelineHybrid Search ArchitectureEmbedding Model Selection and OptimizationEvaluation on Domain DataFine-Tuning Embeddings for Domain PerformanceChunking OptimizationSearch Quality OptimizationQuery ExpansionRelevance FeedbackSearch AnalyticsScaling and PerformanceLatency OptimizationIndex Management at ScaleYour Next Step
Home/Blog/Building Semantic Search Engines โ€” Delivering Search Systems That Understand Meaning, Not Just Keywords
Delivery

Building Semantic Search Engines โ€” Delivering Search Systems That Understand Meaning, Not Just Keywords

A

Agency Script Editorial

Editorial Team

ยทMarch 20, 2026ยท11 min read
semantic searchinformation retrievalembeddingssearch engines

A legal technology agency in Washington, D.C. was hired by a large law firm to modernize their research workflow. The firm's 280 attorneys spent an average of 47 minutes per research query searching through 3.8 million case documents, briefs, and memos using a keyword-based search system. Attorneys complained that finding relevant precedent required guessing the exact terminology used in older documents โ€” searching for "breach of fiduciary duty" missed documents that discussed "violation of trust obligations" or "failure to act in the client's best interest." The agency built a semantic search engine that understood the meaning behind queries, not just the specific words used. Research time dropped to 8 minutes per query. Attorney satisfaction with the search tool jumped from 23% to 91%. The firm estimated the productivity improvement saved $2.7 million annually in billable hours spent on research instead of client work.

Semantic search uses AI to understand the meaning and intent behind search queries and documents, matching them based on conceptual similarity rather than keyword overlap. For AI agencies, semantic search is one of the most impactful deliverables because it directly improves knowledge worker productivity โ€” every organization has people spending significant time searching for information, and semantic search makes that process dramatically faster and more accurate.

How Semantic Search Differs From Keyword Search

The Vocabulary Mismatch Problem

Keyword search requires the query to contain the same words as the relevant document. This works when users know the exact terminology but fails when:

  • Different documents use different words for the same concept ("automobile," "vehicle," "car")
  • Users search with natural language that differs from document language ("how to fix a slow computer" vs. a document titled "Performance Optimization Guide")
  • Domain terminology has evolved over time (older documents use different terms than current practice)
  • Users describe a concept rather than naming it ("the rule about not hiring family members" vs. "anti-nepotism policy")

Semantic search bridges this vocabulary gap by representing both queries and documents as vectors in a shared meaning space, where conceptually similar texts have similar vectors regardless of the specific words used.

When Semantic Search Adds Value

High value:

  • Large document corpuses with diverse vocabulary (legal libraries, research databases, knowledge bases)
  • Users with varying levels of domain expertise (not everyone knows the "correct" search terms)
  • Conceptual queries ("documents about environmental impact of lithium mining" vs. keyword search "lithium mining environmental")
  • Cross-language or cross-terminology search needs

Limited value:

  • Exact match requirements (searching for a specific document ID, case number, or proper name)
  • Highly structured data with consistent terminology
  • Very small corpuses where browsing is practical
  • Users who consistently use precise domain terminology

Best approach: Hybrid search that combines semantic understanding with keyword matching, giving users the benefits of both approaches.

System Architecture

Indexing Pipeline

The indexing pipeline converts documents into searchable representations.

Document ingestion:

  • Accept documents from multiple sources (file systems, databases, APIs, content management systems)
  • Extract text from diverse formats (PDF, Word, HTML, Markdown, plain text)
  • Preserve document metadata (title, author, date, source, category) for filtering and display

Text preprocessing:

  • Clean extracted text (remove boilerplate, headers, footers, page numbers)
  • Segment long documents into searchable chunks (paragraphs, sections, or fixed-size windows with overlap)
  • Each chunk becomes an independently searchable unit while maintaining a reference to its parent document

Embedding generation:

  • Pass each text chunk through an embedding model to generate a dense vector representation
  • Choose the embedding model based on domain, quality requirements, and infrastructure constraints
  • Store embeddings in a vector database alongside the chunk text and metadata

Index maintenance:

  • Monitor for new, updated, and deleted documents
  • Update the index incrementally โ€” add new chunks, update changed chunks, remove deleted chunks
  • Schedule full re-indexing when the embedding model changes or when significant index quality degradation is detected

Query Pipeline

The query pipeline processes user queries and returns relevant results.

Query understanding:

  • Parse the raw query to identify search intent
  • Extract filter criteria from the query ("contracts from 2025" contains a date filter)
  • Expand the query if it is very short or ambiguous (use an LLM to rephrase the query into a more detailed search statement)

Retrieval:

  • Embed the query using the same embedding model used for documents
  • Execute a vector similarity search against the document index
  • Apply metadata filters (date range, document type, author, category) alongside vector search
  • Retrieve the top 50-100 candidates for re-ranking

Re-ranking:

  • Use a cross-encoder re-ranking model to re-score the retrieved candidates
  • The cross-encoder takes the query-document pair as input and produces a relevance score
  • This is more accurate than embedding similarity but too slow to run against the full corpus, so it is applied only to the retrieved candidates
  • Return the top 10-20 re-ranked results to the user

Result presentation:

  • Show relevant text snippets with query-relevant passages highlighted
  • Include metadata (document title, date, source, page number) for context
  • Provide relevance scores or confidence indicators
  • Link to the full document for deeper reading

Hybrid Search Architecture

Combining semantic search with keyword search produces the best results for most enterprise applications.

Implementation approaches:

Parallel retrieval with fusion:

  • Run semantic search and keyword search in parallel on the same query
  • Each returns a ranked list of results
  • Combine the lists using reciprocal rank fusion (RRF) or a learned combination model
  • Re-rank the fused results with a cross-encoder

Single-index hybrid:

  • Store both dense vectors (for semantic search) and sparse vectors (for keyword search) in the same index
  • Most modern vector databases (Weaviate, Qdrant, Pinecone) support hybrid queries
  • Query with both dense and sparse representations simultaneously
  • The database combines scores using a configurable weighting

When to weight keyword search higher:

  • The corpus contains many specific identifiers (product codes, case citations, technical specifications)
  • Users frequently search for exact phrases or proper nouns
  • The domain has precise terminology where exact matching is important

When to weight semantic search higher:

  • Users search with natural language questions
  • The corpus has diverse vocabulary for similar concepts
  • Users are exploring topics rather than finding specific documents

Embedding Model Selection and Optimization

Evaluation on Domain Data

Never select an embedding model based solely on benchmark performance. Evaluate on the client's actual documents and queries.

Evaluation process:

  1. Collect 100-200 representative queries from the client's users (from search logs, user interviews, or commonly asked questions)
  2. For each query, have domain experts identify the 5-10 most relevant documents from the corpus
  3. Embed the queries and the full corpus using each candidate model
  4. Compute retrieval metrics: Recall@10, NDCG@10, MRR
  5. Select the model with the best metrics on the client's data

Models to evaluate:

  • At least one commercial embedding model (OpenAI, Cohere)
  • At least one open-source model (BGE-Large, E5-Large, GTE-Large)
  • A domain-specific model if available (LegalBERT embeddings for legal, BioBERT embeddings for medical)

Fine-Tuning Embeddings for Domain Performance

Fine-tuning an embedding model on the client's domain data typically improves retrieval quality by 10-25%.

Training data for fine-tuning:

  • Positive pairs: Query + relevant document pairs from search logs, expert annotations, or LLM-generated relevance labels
  • Hard negatives: Documents that are somewhat similar to the query but not truly relevant. Mine these from the current search system's results โ€” documents that appear in the top 20 results but are not relevant according to expert judgment
  • Minimum dataset size: 1,000 query-document pairs, with 5-10 hard negatives per query

Fine-tuning approach:

  • Start from a strong general-purpose embedding model
  • Use contrastive learning with in-batch negatives plus mined hard negatives
  • Fine-tune for 3-10 epochs with a low learning rate (1e-5 to 5e-5)
  • Evaluate on the held-out query set after each epoch
  • Select the checkpoint with the best Recall@10 on the evaluation set

Chunking Optimization

The chunking strategy has a significant impact on search quality โ€” often as much as the embedding model choice.

Experiment with multiple strategies:

  • Fixed-size chunks (256 tokens, 512 tokens)
  • Sentence-based chunks (groups of 3-5 sentences)
  • Paragraph-based chunks
  • Section-based chunks (using document headers to identify natural boundaries)

Evaluate each strategy on the same query set and choose the one with the best retrieval metrics. There is no universal best chunking strategy โ€” it depends on the document structure, query types, and embedding model.

Parent-child chunking:

  • Index fine-grained chunks (paragraphs or sentences) for precise retrieval
  • Maintain references from chunks to their parent sections and documents
  • When a fine-grained chunk matches, return the parent section or surrounding context for more complete answers
  • This gives you the precision of small chunks with the context of larger documents

Search Quality Optimization

Query Expansion

Short or ambiguous queries benefit from expansion โ€” adding related terms to improve recall.

LLM-based query expansion:

  • Pass the user's query to an LLM with a prompt like: "Given this search query, generate 3 alternative phrasings that express the same information need in different ways."
  • Embed all query variations and use the average embedding or retrieve with each variation and merge results
  • This dramatically improves recall for short or ambiguous queries

Pseudo-relevance feedback:

  • Execute the initial query
  • Use the top 3-5 results as additional context
  • Generate an expanded query using terms from the top results
  • Re-execute the expanded query
  • This iterative approach improves recall by 10-20%

Relevance Feedback

Incorporate user behavior to improve search quality over time.

Implicit feedback signals:

  • Click-through data: Which results users click on indicates relevance
  • Dwell time: How long users spend reading a document indicates relevance quality
  • Search refinement: When users modify their query, the original query-document pair may not be relevant
  • Download or save actions: Explicit signals of high relevance

Using feedback for improvement:

  • Use click data to mine positive and negative pairs for embedding fine-tuning
  • Use feedback to calibrate the re-ranking model
  • Use aggregated feedback to identify queries with consistently poor results (candidates for index or model improvement)

Search Analytics

Track search quality metrics continuously to identify degradation and improvement opportunities.

Metrics to track:

  • Zero-result rate: Percentage of queries that return no results. Target: below 5%.
  • Click-through rate: Percentage of queries where the user clicks on at least one result. Higher is better.
  • Mean reciprocal rank of clicks: The average position of the first clicked result. Lower (closer to 1) is better.
  • Search abandonment rate: Percentage of queries after which the user does not click any result and does not refine the query. This indicates the search failed to find relevant content.
  • Query refinement rate: Percentage of queries that the user reformulates. Some refinement is normal, but high rates indicate initial results were poor.
  • Time to click: Average time from search results display to first click. Shorter times indicate results are clearly relevant.

Scaling and Performance

Latency Optimization

Enterprise search users expect results in under 1 second.

Latency breakdown for a typical semantic search query:

  • Query embedding: 20-50ms (GPU) or 50-150ms (CPU)
  • Vector search: 10-50ms (depends on index size and configuration)
  • Re-ranking: 100-300ms (for 50-100 candidates with a cross-encoder)
  • Metadata filtering and result formatting: 10-30ms
  • Total: 150-500ms

Optimization strategies:

  • Cache query embeddings for frequent queries
  • Use approximate nearest neighbor search (HNSW) instead of exact search
  • Limit re-ranking to the top 20-50 candidates instead of 100
  • Use a lightweight re-ranking model (MiniLM-based cross-encoder instead of large models)
  • Pre-compute and cache results for the most common queries

Index Management at Scale

For corpuses with millions of documents and hundreds of millions of chunks:

  • Shard the index across multiple nodes for parallel search
  • Replicate shards for read throughput and availability
  • Use product quantization to reduce memory requirements (4-8x memory savings with 2-5% recall reduction)
  • Implement tiered storage โ€” keep hot data (recent documents, frequently accessed documents) on fast storage and cold data on cheaper storage

Your Next Step

Collect the 50 most common search queries from your client's current search system (from search logs or user interviews). For each query, have a domain expert identify the 5 most relevant documents in the corpus. This evaluation set is your ground truth for measuring any search improvement. Run these queries against the current keyword search system and compute Recall@10. Then run them against a basic semantic search prototype (off-the-shelf embeddings, no fine-tuning, no re-ranking) and compare. This side-by-side comparison quantifies the potential improvement from semantic search and identifies the query types where semantic search adds the most value. Use these numbers to scope the project, set accuracy targets, and build the business case.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification