AGENCYSCRIPT
CoursesEnterpriseBlog
👑FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Stage 1 — Represent: Prepare Your Content for MeaningChunking strategyMetadata attachmentStage 2 — Embed: Choose and Calibrate Your Embedding ModelModel selection criteriaConsistency disciplineStage 3 — Store: Choose Your Vector DatabaseCategories of vector storesStage 4 — Index: Configure Your Search IndexANN algorithms and parametersDistance metricsStage 5 — Decompose: Design Your Query PipelineQuery transformationHybrid search and re-rankingMetadata filteringStage 6 — Evaluate: Measure What Actually MattersRetrieval metricsBuilding a test setIteration triggersFrequently Asked QuestionsWhat is the difference between embeddings and vector search?How many dimensions should my embedding vectors have?Can I use embeddings and vector search without building a full RAG pipeline?What goes wrong most often in production vector search systems?How does this framework connect to RAG?When should I fine-tune an embedding model vs. using a general-purpose one?Key Takeaways
Home/Blog/Treat Vector Search as a Pipeline, Not One Step
General

Treat Vector Search as a Pipeline, Not One Step

A

Agency Script Editorial

Editorial Team

·May 13, 2026·10 min read

Semantic search used to be a luxury reserved for teams with ML engineers on staff. That changed when embedding APIs and managed vector databases became commodity infrastructure. Now any agency or professional can build systems that find information by meaning rather than keyword match — but most people who try it get stuck because they're treating it as a single step rather than a pipeline with distinct components, each with its own decisions and failure modes.

This article introduces the RESIDE framework — Represent, Embed, Store, Index, Decompose, Evaluate — a reusable six-stage model for designing, building, and iterating on any embeddings and vector search system. Whether you're building a RAG pipeline for a client knowledge base, a semantic product catalog, or an internal document retrieval tool, this framework gives you a shared vocabulary and a structured sequence of decisions. Work through each stage in order the first time; revisit individual stages when something breaks.

The difference between a vector search system that impresses in a demo and one that holds up in production is almost always traceable to one of two things: poor choices made in the early stages (how content is represented and which model generates the embeddings) or skipped evaluation at the end. Both are fixable. Understanding the full pipeline makes both problems obvious.

Stage 1 — Represent: Prepare Your Content for Meaning

Before any AI model touches your data, you have to make deliberate choices about what unit of information you want to retrieve. This is the most underrated stage of the entire pipeline.

Chunking strategy

Most source documents are too long to embed as a single unit. Embedding models have context windows — commonly 512 to 8,192 tokens — and stuffing a 30-page PDF into a single embedding averages out the meaning until it represents nothing well.

Common chunking strategies:

  • Fixed-size chunks (e.g., 256 or 512 tokens with overlap): fast to implement, good baseline
  • Sentence or paragraph splits: preserves natural semantic boundaries; works well for prose
  • Recursive character splitting: splits on paragraph → sentence → word until chunks hit the target size; the default in most frameworks for good reason
  • Document-structure-aware splitting: split on headers, sections, or code blocks when your content has explicit structure

The overlap parameter matters. A 10–20% token overlap between adjacent chunks prevents retrieval from returning truncated context that cuts off mid-idea.

Metadata attachment

Every chunk should carry metadata: source document, section title, date, author, document type, and any domain-specific tags. Metadata is not used in the embedding but is critical for filtering later. A chunk without metadata is a fact with no provenance — usable in demos, liability in production.

Stage 2 — Embed: Choose and Calibrate Your Embedding Model

An embedding model converts text into a dense vector — a list of floating-point numbers, typically 384 to 3,072 dimensions — that positions the text in a high-dimensional space where proximity equals semantic similarity. This is the core transformation that makes semantic search possible, and it's explained in more depth in A Framework for How Generative AI Works.

Model selection criteria

Not all embedding models are interchangeable. Evaluate on:

  • Domain fit: general-purpose models (OpenAI text-embedding-3-small, Cohere embed-english-v3, bge-large-en) work well for business text. Code, legal, biomedical, and multilingual content benefit from domain-specific or multilingual models.
  • Dimensionality vs. cost: higher dimensions capture more nuance but cost more to store and query. text-embedding-3-small at 1,536 dimensions outperforms many older 768-dimension models while being cheaper per token.
  • Asymmetric vs. symmetric search: if your queries are short questions and your documents are long answers, use a model trained for asymmetric retrieval (e.g., msmarco fine-tunes, or Cohere's search_document / search_query input types).

Consistency discipline

The same model must generate embeddings for both your stored documents and your incoming queries. Switching models after you've indexed content means re-embedding everything. Version-lock your embedding model in production the same way you'd pin a dependency.

Stage 3 — Store: Choose Your Vector Database

A vector database stores embeddings alongside their metadata and exposes APIs for approximate nearest-neighbor (ANN) search. Choosing one is less important than understanding what each class of option trades off.

Categories of vector stores

  • Managed cloud services (Pinecone, Weaviate Cloud, Zilliz): least operational overhead, per-query pricing, fast to start
  • Open-source self-hosted (Qdrant, Weaviate, Chroma, Milvus): full control, no per-query cost at scale, requires ops investment
  • Vector extensions on existing databases (pgvector on PostgreSQL, Redis Vector, Elasticsearch dense_vector): ideal when you already have a relational or document store and want to avoid a new system

The right choice depends on your existing infrastructure, expected query volume, and whether your team has bandwidth to manage another service. For most agencies building client solutions, a managed service for prototyping and a self-hosted option when usage justifies it is the standard progression.

Stage 4 — Index: Configure Your Search Index

Indexing is where you configure how the vector database organizes embeddings to make ANN search fast. Getting this wrong doesn't break search — it makes it slow or less accurate at scale.

ANN algorithms and parameters

The three algorithms you'll encounter most:

  • HNSW (Hierarchical Navigable Small World): the default in most systems. High recall, fast queries, higher memory usage. Tune ef_construction (build quality, higher = better recall, slower build) and M (connections per node, 16–64 is typical range).
  • IVF (Inverted File Index): partitions vectors into clusters. More memory-efficient than HNSW for very large datasets. Tune nlist (number of clusters) and nprobe (clusters searched per query).
  • Flat index: exact search, no approximation. Use only for datasets under ~100K vectors where latency budget allows.

Most practitioners start with HNSW defaults and only tune when they have recall or latency measurements that justify it.

Distance metrics

Choose the distance metric that matches how your embedding model was trained:

  • Cosine similarity: most common; measures angle between vectors, invariant to magnitude
  • Dot product: used when vectors are normalized (equivalent to cosine), or for models that encode relevance scores in magnitude
  • Euclidean (L2): less common for text embeddings but used in some image and multimodal systems

Using the wrong metric degrades recall in ways that look like an embedding quality problem. Check your model's documentation.

Stage 5 — Decompose: Design Your Query Pipeline

A single vector similarity search is rarely enough for production. The Decompose stage is about building the query logic that sits between a user's input and the vector database.

Query transformation

Raw user queries are often ambiguous, conversational, or too short to embed well. Techniques that improve retrieval:

  • HyDE (Hypothetical Document Embeddings): use a language model to generate a hypothetical ideal answer to the query, embed that, then search with it. Works surprisingly well for knowledge-base retrieval.
  • Query expansion: generate 2–4 rephrasings of the query, embed each, and merge result sets. Increases recall at the cost of latency.
  • Step-back prompting: transform a specific question into a more general one to find background context that helps answer it.

Hybrid search and re-ranking

Vector similarity alone has weaknesses: it struggles with exact matches (product codes, proper nouns, legal citations), and it can surface topically adjacent but irrelevant content. Combining vector search with BM25 keyword search — known as hybrid search — addresses both. Most production RAG pipelines use hybrid search by default.

After retrieval, a cross-encoder re-ranker (Cohere Rerank, bge-reranker, or a fine-tuned model) reads the query and each candidate chunk together and scores true relevance. Re-ranking the top 20–50 vector results down to the top 5–10 is one of the highest-leverage improvements you can make to retrieval quality. This composable approach to AI pipelines is the same principle behind how generative AI works best practices applied to retrieval.

Metadata filtering

Use metadata to pre-filter before or post-filter after vector search. Searching only the documents published after a certain date, belonging to a specific client, or tagged with a specific category dramatically improves precision without touching the embedding logic.

Stage 6 — Evaluate: Measure What Actually Matters

Most failed vector search projects fail here — not because evaluation is hard, but because it's skipped. Embedding search is not a set-and-forget operation.

Retrieval metrics

  • Recall@K: of the truly relevant documents, what fraction appear in the top K results? This is your primary metric.
  • Precision@K: of the K returned results, what fraction are actually relevant?
  • MRR (Mean Reciprocal Rank): how high does the first relevant result appear? Important when users act on the top result only.
  • NDCG (Normalized Discounted Cumulative Gain): weighted by position; the standard metric for search ranking systems.

Building a test set

You need labeled query-document pairs to evaluate against. Sources: subject matter experts annotating 50–200 query-answer pairs, synthetic generation using a language model to generate questions from your own documents, or mining user interaction logs if you have them. Anything above 50 labeled examples gives you meaningful signal. The generative AI real-world examples article covers how similar evaluation loops work in generative contexts.

Iteration triggers

Run evaluation before shipping and after any change to chunking strategy, embedding model, index configuration, or query pipeline. Track metrics in a simple spreadsheet or experiment tracker. A 5-percentage-point change in Recall@5 is almost always meaningful and worth understanding before deploying.

Frequently Asked Questions

What is the difference between embeddings and vector search?

Embeddings are numerical representations of content — text, images, or other data — that encode semantic meaning as a list of numbers called a vector. Vector search is the process of querying a database of those vectors to find the ones most similar to a query vector. Embeddings make the representation; vector search makes the retrieval.

How many dimensions should my embedding vectors have?

Most current text embedding models output between 384 and 3,072 dimensions. Higher dimensionality generally captures more nuance but increases storage and query cost. For most business text applications, 768 to 1,536 dimensions offers a good balance; only go higher if retrieval benchmarks on your specific domain show a meaningful improvement.

Can I use embeddings and vector search without building a full RAG pipeline?

Yes. Semantic search over a document library, duplicate detection, clustering similar support tickets, or finding visually similar products are all standalone uses of embeddings and vector search that don't require connecting a generative model at the end.

What goes wrong most often in production vector search systems?

The two most common failure modes are inconsistent chunking (chunks too large to embed specific meaning, or too small to carry sufficient context) and missing evaluation (no labeled test set, so quality degradation goes undetected). A close third is embedding model drift — updating the model without re-indexing existing content.

How does this framework connect to RAG?

Retrieval-Augmented Generation (RAG) uses vector search as its retrieval layer: the pipeline fetches relevant chunks and passes them to a language model as context. The RESIDE framework covers the retrieval half of RAG in full. The generation half — prompting, output structuring, and grounding — is covered in the generative AI case study.

When should I fine-tune an embedding model vs. using a general-purpose one?

Start with a general-purpose model and evaluate on your domain. Fine-tuning becomes worth the investment when off-the-shelf models score below roughly 0.70 Recall@5 on your labeled test set, when you have highly specialized vocabulary (clinical, legal, proprietary technical terms), or when you're operating at a scale where retrieval quality has significant revenue impact.

Key Takeaways

  • The RESIDE framework gives you six distinct stages — Represent, Embed, Store, Index, Decompose, Evaluate — each with its own decisions and failure modes.
  • Chunking strategy and overlap settings in Stage 1 have more impact on retrieval quality than switching embedding models.
  • Version-lock your embedding model; mixing model versions across indexed content and live queries silently degrades recall.
  • Hybrid search (vector + BM25) and cross-encoder re-ranking are the two highest-leverage improvements for moving from prototype to production quality.
  • You cannot manage what you don't measure — a 50-query labeled test set is the minimum viable evaluation setup, and it prevents invisible quality regression.
  • Metadata design is infrastructure, not an afterthought; it enables filtering, provenance, and auditing that pure vector similarity cannot provide.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification