A legal tech agency in New York was hired by a multinational corporation to build a system that could find relevant precedent clauses across their library of 4.2 million contracts. The existing keyword search was slow, missed semantic matches (searching for "termination clause" would not find "exit provisions"), and returned hundreds of irrelevant results. The agency built a vector search system that embedded every clause into a 768-dimensional vector space, indexed them with HNSW, and served similarity queries in under 200 milliseconds. Contract review time dropped by 73%. Lawyers who previously spent 45 minutes searching for relevant clauses could now find the top 10 most semantically similar clauses in seconds. The system handled 12,000 queries per day with sub-second latency and became the backbone of the corporation's contract intelligence platform.
Vector search is the practice of converting data into numerical vectors (embeddings) and finding similar items by measuring the distance between vectors in high-dimensional space. Unlike keyword search, vector search captures semantic meaning โ it finds items that are conceptually similar even when they use different words. For AI agencies, vector search is a foundational capability that powers semantic search, recommendation systems, RAG (retrieval-augmented generation), deduplication, and anomaly detection.
Scoping Enterprise Vector Search Projects
Understanding the Search Problem
The first step in any vector search project is understanding exactly what the client needs to find, in what corpus, and with what constraints.
Key scoping questions:
- What is being searched? Text documents, images, product descriptions, customer records, code repositories, audio transcripts. The data type determines the embedding model.
- What is the corpus size? Thousands, millions, or billions of items. Corpus size determines the vector database and indexing strategy.
- What does a good result look like? Show the client examples of good and bad search results to calibrate expectations.
- What is the latency requirement? Sub-100ms for interactive search, sub-second for batch retrieval, seconds for background processing.
- What is the freshness requirement? How quickly must new items become searchable after ingestion? Real-time (seconds), near-real-time (minutes), or batch (hours)?
- What metadata filtering is needed? Can results be filtered by date, category, author, or other attributes alongside vector similarity?
- What is the query volume? Queries per second at peak load determines infrastructure sizing.
Defining Success Metrics
Vector search quality is subjective without defined metrics. Agree on metrics before building the system.
Retrieval metrics:
- Recall@K: Of the true relevant documents, what proportion appears in the top K results? This is the most important metric for most applications.
- Precision@K: Of the top K results, what proportion is truly relevant?
- Mean Reciprocal Rank (MRR): How high does the first relevant result appear in the ranking? MRR of 1.0 means the first result is always relevant.
- Normalized Discounted Cumulative Gain (NDCG): Measures ranking quality โ are the most relevant results ranked highest?
Operational metrics:
- Query latency (p50, p95, p99)
- Throughput (queries per second)
- Index size (storage cost)
- Ingestion latency (time from new document to searchable)
Business metrics:
- User click-through rate on search results
- Time to find relevant information (measured via user studies)
- User satisfaction scores
- Task completion rate (for search-dependent workflows)
Embedding Pipeline Design
Embedding Model Selection
The embedding model is the most important component of a vector search system. It determines the quality of the vector representations and, ultimately, the relevance of search results.
Text embedding models (ranked by quality for enterprise use):
- Cohere Embed v3: Excellent multilingual performance, 1024 dimensions, strong for enterprise document search. API-based.
- OpenAI text-embedding-3-large: High quality, 3072 dimensions (can be reduced with Matryoshka embeddings), good general-purpose choice. API-based.
- BGE-Large (BAAI): Open-source, 1024 dimensions, competitive with commercial models. Can be self-hosted for data privacy.
- E5-Large-v2 (Microsoft): Open-source, 1024 dimensions, strong performance on retrieval benchmarks. Self-hostable.
- GTE-Large (Alibaba): Open-source, 1024 dimensions, excellent for enterprise search tasks. Self-hostable.
- Sentence-Transformers (all-MiniLM-L6-v2): Lightweight open-source model, 384 dimensions, good for cost-sensitive applications with moderate quality requirements.
Model selection criteria:
- Quality: Evaluate on a representative sample of the client's data. The best model on benchmarks may not be the best model for the client's domain.
- Dimensionality: Higher dimensions capture more information but increase storage and search costs. 768-1024 dimensions is the sweet spot for most applications.
- Self-hosted vs. API: Self-hosted models provide data privacy and lower per-query costs at scale. API models provide simplicity and avoid infrastructure management. For regulated industries, self-hosted is often required.
- Domain specificity: If a domain-specific embedding model exists (legal, medical, scientific), evaluate it against general-purpose models on the client's data. Domain-specific models often win by 5-15%.
Fine-Tuning Embeddings
Off-the-shelf embedding models provide a strong baseline, but fine-tuning on the client's data typically improves retrieval quality by 10-25%.
Fine-tuning approaches:
- Contrastive learning: Create positive pairs (query + relevant document) and negative pairs (query + irrelevant document). Fine-tune the embedding model to bring positive pairs closer together and push negative pairs apart in vector space.
- Hard negative mining: The most impactful technique for improving fine-tuned embedding quality. Instead of random negatives, use the current model to find documents that are similar to the query but not relevant. These hard negatives force the model to learn more discriminative representations.
- In-domain pre-training: Before fine-tuning on relevance pairs, continue pre-training the language model on the client's document corpus. This teaches the model the domain vocabulary and writing style.
Training data for fine-tuning:
- Minimum 1,000 query-document relevance pairs for meaningful fine-tuning
- Ideal: 5,000-20,000 pairs for robust quality improvement
- Sources: historical search logs with click data, human relevance judgments, LLM-generated relevance labels
Chunking Strategy
Most enterprise documents are too long to embed as a single vector. Chunking โ splitting documents into smaller segments โ is necessary for granular retrieval.
Chunking approaches:
- Fixed-size chunks: Split documents into chunks of a fixed number of tokens (256-512 tokens). Simple but may split sentences or ideas mid-thought.
- Sentence-based chunks: Split at sentence boundaries, grouping sentences into chunks of target size. Preserves semantic coherence within chunks.
- Paragraph-based chunks: Use paragraph boundaries as natural chunk boundaries. Works well for well-structured documents.
- Semantic chunking: Use a model to identify topic boundaries within the document and split at those boundaries. Produces the most semantically coherent chunks but is more expensive to compute.
- Hierarchical chunking: Create chunks at multiple granularities โ paragraph-level, section-level, and document-level โ and index all of them. Search at the finest granularity and use parent-child relationships to provide context.
Chunking configuration parameters:
- Chunk size: 256-512 tokens for most applications. Smaller chunks improve precision (the retrieved chunk is more focused), larger chunks improve recall (the chunk contains more context).
- Chunk overlap: 50-100 tokens of overlap between consecutive chunks to prevent splitting entities or concepts that span chunk boundaries.
- Metadata preservation: Attach the source document ID, section header, page number, and surrounding context to each chunk for retrieval context.
The chunking strategy has a significant impact on retrieval quality โ often as much as the embedding model choice. Test multiple chunking approaches on a representative evaluation set before committing to a strategy.
Vector Database Selection
Database Options
Pinecone: Managed vector database with excellent operational simplicity. Supports metadata filtering, namespaces for multi-tenant isolation, and hybrid search (vector + keyword). Best for teams that want to minimize infrastructure management.
Weaviate: Open-source vector database with built-in vectorization modules, hybrid search, and a GraphQL API. Strong for applications that need flexible schema and querying. Can be self-hosted or used as a managed service.
Qdrant: Open-source vector database with a focus on performance and filtering. Excellent metadata filtering capabilities and payload storage. Written in Rust, known for low latency and high throughput. Can be self-hosted or used as a managed service.
Milvus/Zilliz: Open-source (Milvus) or managed (Zilliz) vector database designed for scale. Handles billions of vectors. Best for the largest enterprise deployments.
pgvector (PostgreSQL extension): Vector search as a PostgreSQL extension. Best for applications where the vector index is small (under 10 million vectors) and the team already uses PostgreSQL. Simplifies architecture by avoiding a separate database.
Elasticsearch with vector search: If the client already uses Elasticsearch for keyword search, adding vector search via dense vector fields avoids introducing a new database. Good for hybrid search applications.
Indexing Algorithms
The indexing algorithm determines the tradeoff between search accuracy (recall) and search speed.
HNSW (Hierarchical Navigable Small World): The default choice for most applications. Provides high recall (95-99%) at low latency. Memory-resident, so it requires enough RAM to hold the entire index. Good for datasets up to 100 million vectors per node.
IVF (Inverted File Index): Partitions the vector space into clusters and searches only the most relevant clusters. Lower memory requirements than HNSW but lower recall at the same latency. Good for very large datasets where HNSW memory requirements are prohibitive.
Product Quantization (PQ): Compresses vectors to reduce memory usage by 4-8x with moderate accuracy loss (2-5% recall reduction). Use in combination with HNSW or IVF when memory is constrained.
Flat (brute-force): Exhaustive search with perfect recall. Only practical for small datasets (under 100,000 vectors) or as a benchmark for evaluating approximate methods.
Recommended defaults:
- Under 1 million vectors: HNSW with full-precision vectors
- 1-100 million vectors: HNSW with scalar quantization
- Over 100 million vectors: IVF + PQ, or HNSW with product quantization
Hybrid Search
Combining vector search with keyword search often produces better results than either approach alone. Vector search excels at semantic matching, while keyword search excels at exact term matching (product names, part numbers, legal citations).
Hybrid search implementation:
- Run vector search and keyword search in parallel
- Combine results using reciprocal rank fusion (RRF) โ a simple, effective method that merges two ranked lists by combining the reciprocal of each result's rank
- Allow the client to tune the weight between vector and keyword components based on their search quality preferences
When to use hybrid search:
- The corpus contains both natural language and structured identifiers (codes, numbers, names)
- Users sometimes search with exact terms and sometimes with conceptual queries
- The domain has specific terminology where exact matching is important (medical codes, legal citations, product SKUs)
Production Architecture
Ingestion Pipeline
Real-time ingestion:
- New document arrives (via API, file upload, or message queue)
- Document is preprocessed (OCR if needed, text extraction, cleaning)
- Document is chunked according to the configured strategy
- Each chunk is embedded using the embedding model
- Vectors and metadata are inserted into the vector database
- The document is marked as searchable
Batch ingestion for initial load:
- Extract all documents from the source system
- Preprocess and chunk in parallel using distributed computing (Spark, Ray, or multi-process Python)
- Embed chunks in batches using GPU-accelerated inference
- Bulk-load vectors into the vector database
- Build or optimize the index after bulk loading
Ingestion performance:
- A single GPU (A10G) running a 768-dimension embedding model can embed approximately 500-2,000 chunks per second depending on chunk length
- Bulk loading 10 million chunks takes approximately 2-6 hours including embedding and indexing
- Plan for the initial corpus load plus ongoing ingestion of new documents
Query Pipeline
Query processing steps:
- Receive the search query from the user or application
- Preprocess the query (expand abbreviations, correct spelling, identify filters)
- Embed the query using the same embedding model used for documents
- Execute vector search against the database (with metadata filters if applicable)
- Re-rank results using a cross-encoder model for improved relevance (optional but recommended)
- Format and return results with snippets, metadata, and relevance scores
Re-ranking with cross-encoders:
Cross-encoder re-ranking is the single highest-leverage technique for improving search relevance after the initial vector retrieval. Retrieve the top 50-100 results with vector search (fast but approximate relevance), then re-rank with a cross-encoder (slow but precise relevance) and return the top 10.
Cross-encoders to consider:
- Cohere Rerank: API-based, high quality, simple integration
- BGE-Reranker (BAAI): Open-source, self-hostable, strong performance
- ms-marco-MiniLM cross-encoders: Lightweight, open-source, good for cost-sensitive deployments
Re-ranking typically improves NDCG@10 by 10-30% compared to vector search alone.
Scaling for Enterprise
Horizontal scaling:
- Shard the vector index across multiple nodes for datasets that exceed single-node memory capacity
- Replicate shards for read throughput and availability
- Most vector databases handle sharding and replication natively โ configure the replica count and shard count based on your query volume and availability requirements
Caching:
- Cache frequent queries and their results (many search applications have a Zipf distribution where a small number of queries account for a large proportion of search volume)
- Cache embedding vectors for frequent queries to avoid redundant embedding computation
- Invalidate cache entries when the underlying index changes
Multi-tenancy:
- For agencies serving multiple clients from a shared infrastructure, isolate client data using namespaces, collections, or separate databases depending on the vector database's multi-tenancy features
- Apply query-time filtering to ensure clients can only search their own data
- Monitor per-client query volume and latency to ensure fair resource allocation
Evaluation and Testing
Building an Evaluation Set
A high-quality evaluation set is essential for measuring search quality and validating improvements.
Evaluation set composition:
- 200-500 queries representative of real user queries
- For each query, 10-20 documents annotated with relevance grades (not relevant, partially relevant, highly relevant)
- Include a mix of easy queries (where relevant documents use the same terminology as the query) and hard queries (where relevant documents use different terminology)
- Include edge cases: very short queries, very long queries, queries with typos, queries with domain jargon
Evaluation set creation:
- Mine queries from search logs, support tickets, or user interviews
- For each query, retrieve the top 20 results from your search system and from a keyword baseline
- Have domain experts annotate relevance for the pooled results
- Version the evaluation set and update it quarterly as the corpus evolves
Continuous Evaluation
Run automated evaluation against the evaluation set whenever you change the embedding model, chunking strategy, indexing parameters, or re-ranking configuration.
Evaluation pipeline:
- Load the evaluation set (queries and relevance judgments)
- Execute each query against the current search system
- Compute retrieval metrics (Recall@10, Precision@10, MRR, NDCG@10)
- Compare to the previous evaluation results
- Flag any metric that degraded by more than 2%
- Generate a report showing per-query performance
Your Next Step
Pick one search problem your client has today โ a corpus they need to search semantically rather than by keyword. Embed 1,000 representative documents using an off-the-shelf embedding model, load them into a hosted vector database (Pinecone's free tier or a local Qdrant instance), and run 20 representative queries. Show the results to the client alongside their current keyword search results. The side-by-side comparison is the most powerful sales tool for vector search projects โ when the client sees that a semantic query finds relevant documents that keyword search completely missed, the value proposition sells itself. Build the prototype in a day, use it to scope the production project in a week.