AGENCYSCRIPT
CoursesEnterpriseBlog
👑FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

What You're Actually Choosing BetweenThe Five Axes That Actually Matter1. Retrieval Quality vs. Speed2. Corpus Size and Update Frequency3. Operational Burden4. Embedding Stability5. Domain SpecificityEmbedding Model Options and When to Choose EachGeneral-Purpose API ModelsOpen-Source / Self-Hosted ModelsVector Store Options and When to Choose EachManaged CloudSelf-Hosted DedicatedExisting Infrastructure ExtensionsExact Search vs. Approximate Nearest NeighborHybrid Search: When Pure Vector Isn't EnoughThe Decision RuleFrequently Asked QuestionsHow many vectors can pgvector handle before it struggles?Does it matter which embedding model I use as long as I'm consistent?What is Matryoshka embedding and should I use it?When should I add a reranker to my retrieval pipeline?Is it better to use one large chunk or many small chunks when embedding?What is the real cost difference between API embeddings and self-hosting?Key Takeaways
Home/Blog/Re-Embedding Your Whole Corpus Is the Cost of Guessing Wrong
General

Re-Embedding Your Whole Corpus Is the Cost of Guessing Wrong

A

Agency Script Editorial

Editorial Team

·May 11, 2026·11 min read

Choosing the wrong embedding model or vector database architecture is one of the most expensive early mistakes an AI team can make. The fix rarely costs just an afternoon — it usually means re-embedding your entire corpus, migrating data, and retuning retrieval pipelines at exactly the moment your team has momentum to lose. Getting the decision right upfront requires understanding what you're actually choosing between, not just which product has the best landing page.

This article is built for teams standing at that decision point: you understand roughly what embeddings are (dense numerical representations of meaning), you've started building or scoping a retrieval-augmented generation (RAG) system or semantic search feature, and you need a principled way to evaluate your options. We'll walk through the real axes of trade-off — not the marketing ones — and give you a concrete decision rule you can apply to your actual situation.

The payoff is avoiding the most common failure modes: paying for retrieval quality you don't need, scaling an architecture that can't handle your data volume, or locking into an embedding model whose behavior changes under your feet.


What You're Actually Choosing Between

Embeddings and vector search involve two separate decision trees that get collapsed into one in most guides. Keep them separate.

Embedding model decisions: Which model converts your content into vectors? Options range from general-purpose API models (OpenAI's text-embedding-3-small and text-embedding-3-large, Cohere's Embed v3, Google's text-embedding-004) to open-source models you host yourself (the bge family from BAAI, e5-mistral, nomic-embed-text). Dimension counts, max token windows, cross-lingual support, and fine-tunability vary enormously.

Vector store decisions: Where do those vectors live and how does similarity search happen? Options include dedicated vector databases (Pinecone, Weaviate, Qdrant, Milvus), vector extensions on relational databases (pgvector on Postgres), integrated solutions (Supabase, Neon with pgvector), and approximate nearest neighbor (ANN) libraries you wire up yourself (FAISS, HNSWlib).

These two choices interact — a high-dimension embedding model (3072 dimensions for OpenAI's large model) changes storage costs and query latency in your vector store — but they also have independent failure modes. Treat them that way.


The Five Axes That Actually Matter

Before comparing specific products, get clear on which axes drive your situation. Most teams optimize on one or two while ignoring the others, then discover the others matter most in production.

1. Retrieval Quality vs. Speed

Higher-quality embedding models tend to produce denser, more semantically nuanced representations. But they also tend to have larger dimension counts (1536–3072 vs. 384–768 for lighter models), which increases both index size and query latency. For a customer-facing search feature with sub-200ms requirements, a 384-dimension open-source model that runs locally often beats a cloud API model that adds 80–150ms of network round-trip on every query.

2. Corpus Size and Update Frequency

A static corpus of 10,000 documents behaves very differently from a live corpus of 10 million documents that changes hourly. At smaller scales, almost every solution works. At larger scales, you need to care about:

  • Index type: Flat (exact) vs. HNSW vs. IVF-PQ (each trades recall for speed differently)
  • Update patterns: HNSW is notoriously slow at deletes and updates compared to IVF indexes
  • Sharding: Pinecone and Qdrant handle this for you; self-hosted FAISS does not

3. Operational Burden

Managed cloud services (Pinecone, Weaviate Cloud) abstract away infrastructure but add cost and a dependency. Self-hosted options (Qdrant on a VM, Milvus on Kubernetes, pgvector on your existing Postgres) require operational ownership but give you more control and often lower unit costs at scale.

4. Embedding Stability

This one gets overlooked until it causes pain. If your embedding model updates — whether you pull a new checkpoint of an open-source model or a vendor silently updates their API — your existing vectors become misaligned with new vectors. Queries against a mixed-version index degrade in ways that are hard to debug. Always version-pin your embedding models and plan for periodic re-embedding as a maintenance task.

5. Domain Specificity

General-purpose embedding models trained on web-scale text perform well on general queries. They underperform on highly specialized corpora: clinical notes, legal contracts, proprietary technical documentation, code. Fine-tuning a smaller open-source model on domain-specific pairs often outperforms a larger general-purpose API model in these cases — sometimes significantly — at a fraction of the cost per query.


Embedding Model Options and When to Choose Each

General-Purpose API Models

OpenAI text-embedding-3-small / 3-large: The safe default for most teams starting out. Strong benchmark performance across English and multilingual tasks. The small model (1536 dimensions, reducible to 256 via Matryoshka) is cost-effective for most use cases. The large model (3072 dimensions) earns its price only if you're seeing measurable retrieval quality gaps in evaluation.

Cohere Embed v3: Differentiated primarily by its input_type parameter — you specify whether text is a search query or a document at embedding time, which meaningfully improves retrieval quality. Worth evaluating if you're building query-document retrieval systems.

Google text-embedding-004: Strong multilingual performance. Reasonable choice if you're already operating in the Google Cloud ecosystem.

Open-Source / Self-Hosted Models

BGE (BAAI General Embedding) family: Strong performers on the MTEB leaderboard across multiple tasks. bge-m3 is a standout for multilingual use cases. License is permissive.

Nomic Embed Text: 8192-token context window, which matters for long-document use cases where chunking is lossy. Open weights, commercially usable.

E5-mistral-7b: High retrieval quality but at significant inference cost. Only sensible if you're running large-scale pipelines where per-query API costs outweigh self-hosting at scale.


Vector Store Options and When to Choose Each

Managed Cloud

Pinecone: The easiest path to production. Strong performance, good developer experience, generous free tier. Cost scales with index size and query volume in ways that can surprise teams — run unit economics at your expected scale before committing.

Weaviate Cloud: Adds hybrid search (BM25 + vector) out of the box, which is often the right retrieval strategy anyway (see below). More opinionated than Pinecone; the trade-off is more built-in functionality vs. more complexity to understand.

Self-Hosted Dedicated

Qdrant: Strong performance benchmarks, actively maintained, good Rust-based efficiency. Reasonable to run as a Docker container for small-to-medium workloads. Supports payload filtering natively, which matters if you need metadata-gated retrieval.

Milvus: Enterprise-grade, designed for very large-scale deployments. Overkill for most teams below tens of millions of vectors; worth considering above that threshold.

Existing Infrastructure Extensions

pgvector on Postgres: The pragmatic choice if you already operate Postgres. Performance is adequate for corpora under roughly 1–5 million vectors with HNSW indexing. The massive advantage is zero new infrastructure: no additional service to operate, monitor, or pay for. A Framework for How Generative AI Works covers this integration pattern in context.


Exact Search vs. Approximate Nearest Neighbor

Exact (flat/brute-force) search guarantees recall but scales at O(n) — it examines every vector. Practical only up to a few hundred thousand vectors unless you have significant compute.

ANN algorithms (HNSW, IVF-PQ, ScaNN) trade a small, tunable amount of recall for orders-of-magnitude faster search. The two dominant approaches:

HNSW (Hierarchical Navigable Small World): Excellent query speed, high recall at reasonable ef parameters. Poor at updates — if you delete or update vectors frequently, graph integrity degrades. Good for read-heavy, relatively static indexes.

IVF-PQ (Inverted File with Product Quantization): Better at handling large, dynamic indexes. More parameters to tune. Lower memory footprint than HNSW at large scale due to quantization.

Most managed vector databases handle these choices internally; the reason to understand them is knowing when to push back on defaults and how to interpret performance benchmarks.


Hybrid Search: When Pure Vector Isn't Enough

A common failure mode: teams build pure vector search expecting it to outperform keyword search in all cases. It doesn't. Vector search excels at semantic similarity — finding conceptually related content even without lexical overlap. Keyword search (BM25) excels at exact-match queries, proper nouns, product codes, and rare terms that embedding models may have seen infrequently.

Hybrid search — combining BM25 and vector similarity scores (typically via Reciprocal Rank Fusion or a learned reranker) — consistently outperforms either method alone across a wide range of retrieval tasks. This is well-supported in Weaviate, Elasticsearch, and can be implemented in pgvector with some custom logic.

The Best Tools for How Generative AI Works catalogs several platforms that provide this hybrid retrieval layer natively.

If you're building retrieval for a RAG pipeline specifically, also evaluate adding a cross-encoder reranker (Cohere Rerank, bge-reranker, cross-encoder/ms-marco) as a second-pass over your top-k retrieved candidates. Rerankers are slower than embedding-based retrieval but significantly more accurate — used on a small candidate set (top 20–50), the latency cost is usually acceptable.


The Decision Rule

Reduce your situation to three questions:

1. What is your corpus size and update pattern?

  • Under 1M vectors, static or slow-moving → pgvector with HNSW, or any managed service
  • 1M–10M vectors, moderate updates → Qdrant or Pinecone
  • 10M+ vectors or high-frequency updates → Milvus, Weaviate, or a purpose-designed pipeline

2. Do you need domain-specific performance?

  • General content, English-primary → Start with text-embedding-3-small or bge-base-en-v1.5
  • Specialized domain or multilingual → Evaluate fine-tuned open-source models; run MTEB-style evals on your own data
  • Long documents (>512 tokens meaningfully) → nomic-embed-text or chunking strategy review

3. What is your operational capacity?

  • No dedicated ML/infra eng → Managed cloud (Pinecone, Weaviate Cloud) + API embedding model
  • Existing Postgres, small corpus → pgvector; don't add complexity you don't need
  • Engineering capacity, cost sensitivity at scale → Self-hosted Qdrant or Milvus + open-source embedding

Case Study: How Generative AI Works in Practice shows how this decision tree plays out in a real deployment, including the places teams typically have to revise their initial choices.


Frequently Asked Questions

How many vectors can pgvector handle before it struggles?

With HNSW indexing, pgvector performs adequately up to roughly 1–5 million vectors on modern hardware, depending on dimension count and query throughput requirements. Beyond that range — or if you need sub-10ms P95 latency at high concurrency — a dedicated vector database typically pulls ahead. The threshold isn't hard; benchmark against your actual workload before migrating.

Does it matter which embedding model I use as long as I'm consistent?

Consistency within an index is essential — all your vectors must come from the same model version. But the choice of model materially affects retrieval quality, especially on domain-specific content. Don't assume a larger or more expensive model automatically wins; run retrieval evals on a sample of your actual queries before committing.

What is Matryoshka embedding and should I use it?

Matryoshka Representation Learning trains embeddings so that the first N dimensions are themselves a meaningful lower-dimensional embedding. OpenAI's text-embedding-3 models support this, letting you reduce from 1536 to 256 dimensions with modest quality loss. Useful when storage or latency costs are constraining; run quality benchmarks at your target dimension before deploying at scale.

When should I add a reranker to my retrieval pipeline?

Add a reranker when retrieval quality matters more than latency and you can afford a second-pass computation over your top-k candidates. Typical triggers: your RAG pipeline produces good retrieved documents but the LLM still generates poor answers (often a relevance ordering problem), or evaluations show your top-1 precision is low. The How Generative AI Works Checklist for 2026 includes a retrieval evaluation framework that helps identify when reranking is the right fix.

Is it better to use one large chunk or many small chunks when embedding?

Neither extreme is optimal. Large chunks (>512 tokens) lose precision because the embedding averages over too much content. Small chunks (<50 tokens) lose context and produce noisy embeddings. A practical default: 256–512 token chunks with 10–20% overlap. Then test retrieval quality on your actual query distribution — chunk size is often the highest-leverage variable to tune.

What is the real cost difference between API embeddings and self-hosting?

At low volume (under a few million embeddings per month), API models are almost always cheaper when you factor in engineering and infrastructure time. At high volume or real-time embedding of user-generated content, self-hosting a model like bge-base on a GPU instance typically reaches break-even in the range of tens of millions of tokens per day. Run the math for your specific throughput; don't assume either direction without numbers.


Key Takeaways

  • Embeddings and vector search are two separate decision trees with independent trade-offs; keep them distinct.
  • The five axes that matter: retrieval quality vs. speed, corpus size and update frequency, operational burden, embedding stability, and domain specificity.
  • pgvector is the right default for small corpora on existing Postgres infrastructure; dedicated vector databases earn their place above 1–5M vectors.
  • Hybrid search (vector + BM25) consistently outperforms pure vector retrieval; default to it unless you have a specific reason not to.
  • Version-pin your embedding models — silent updates cause index drift that degrades retrieval quality in hard-to-diagnose ways.
  • Run evals on your actual query distribution, not synthetic benchmarks; the winning model and architecture depend on your data, not the leaderboard.
  • Rerankers are the highest-leverage retrieval improvement for RAG pipelines when baseline retrieval is functional but ordering quality is poor.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification