AGENCYSCRIPT
CoursesEnterpriseBlog
👑FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

What an Embedding Actually IsWhy Meaning, Not KeywordsThe Model Does the Heavy LiftingWhat Vector Search DoesWhy Not Just Use a Regular Database?Exact vs. ApproximatePrerequisites Before You Write a Line of CodeChoosing an Embedding ModelChoosing a Vector StoreBuilding Your First Retrieval Pipeline: The Five StepsStep 1: Chunk Your DocumentsStep 2: Embed Each ChunkStep 3: Index and StoreStep 4: Embed the Query and SearchStep 5: Evaluate Against Your Ground TruthCommon Failure Modes and How to Fix ThemWhat Comes After Basic RetrievalFrequently Asked QuestionsDo I need a vector database, or can I use a regular database?How long does it take to embed a large document corpus?What's the difference between semantic search and embeddings and vector search?Can I use embeddings for things other than text search?How do I know if my retrieval quality is good enough?Is vector search the same as what's used in large language models?Key Takeaways
Home/Blog/Answering Questions About Your Own Internal Documents
General

Answering Questions About Your Own Internal Documents

A

Agency Script Editorial

Editorial Team

·May 7, 2026·11 min read

If you've ever wondered how a chatbot can answer questions about your company's internal documents, or how a search bar can surface a relevant result even when the user's words don't match the document's words — you're looking at embeddings and vector search. These two technologies are the quiet infrastructure behind most of the "smart" AI features that agencies and operators are building right now.

This article gives you the fastest credible path from zero to a working mental model and a first real result. You don't need a machine learning background. You do need to understand what these tools actually do, why they work better than keyword search for certain problems, and what a minimal implementation looks like so you can evaluate whether it belongs in your stack. By the end, you'll be able to scope a real use case, pick the right tools, and avoid the three or four mistakes that trip up most first-timers.

The context matters: embeddings and vector search don't live in isolation. They're usually one component in a larger AI system — most commonly a retrieval-augmented generation (RAG) pipeline, where you retrieve relevant context and feed it to a language model. If you're still forming your mental model of how those language models work, Getting Started with How Generative AI Works is a useful complement to this article. Read both together and the architecture will click faster.


What an Embedding Actually Is

An embedding is a list of numbers — a vector — that represents the meaning of a piece of text (or an image, or audio, depending on the model). The vector for "dog" and the vector for "puppy" will be close together in that numerical space. The vector for "dog" and the vector for "invoice" will be far apart.

The length of that list is the embedding's dimensionality. Common embedding models produce vectors with 384, 768, 1536, or 3072 dimensions. Higher isn't always better — it depends on the model's training quality and your storage constraints.

Why Meaning, Not Keywords

Traditional keyword search works by matching tokens. If a user searches "vehicle accident report" and your document says "car crash incident," a keyword index misses it. An embedding model trained on large text corpora understands that these phrases are semantically close and places their vectors near each other. That's the core value proposition.

The Model Does the Heavy Lifting

You don't write the logic that maps meaning to numbers. A pre-trained embedding model does it. You send text in; you get a vector out. The quality of that mapping depends entirely on the model you choose and how well its training data matches your domain. A general-purpose model works well for most English-language content. A model fine-tuned on medical or legal text will outperform it in those domains.


What Vector Search Does

Once you have vectors, you need to search them efficiently. Vector search (also called approximate nearest neighbor search, or ANN) answers the question: given a query vector, which stored vectors are most similar to it?

Similarity is usually measured with cosine similarity or dot product. Cosine similarity ranges from -1 to 1, where 1 means identical direction. In practice, a score above 0.80 typically indicates strong semantic match; 0.60–0.80 is a reasonable candidate; below 0.60 is usually noise, though these thresholds vary by model and domain.

Why Not Just Use a Regular Database?

A relational database can't efficiently compute similarity across millions of high-dimensional vectors. It would require a brute-force comparison against every row. Vector databases and vector search libraries use index structures — HNSW (Hierarchical Navigable Small World) graphs and IVF (Inverted File Index) are the dominant ones — that dramatically reduce the number of comparisons needed, making search across millions of vectors return results in milliseconds.

Exact vs. Approximate

Most production vector search is approximate: it finds vectors that are very likely the nearest neighbors, not guaranteed. The trade-off between recall (finding the true best match) and speed is configurable. For most agency use cases — document search, product recommendations, FAQ matching — approximate search is more than sufficient.


Prerequisites Before You Write a Line of Code

Rushing into implementation without these in place is how you waste a week.

  • A clear retrieval task. You need a specific question: "Find the three most relevant support articles for this customer query." Vague goals produce architectures that don't work.
  • A representative sample of your data. At minimum 50–100 documents or records that reflect real-world variety. You'll use these to test whether your retrieval is actually returning useful results.
  • An evaluation method. Before indexing anything, decide how you'll judge quality. The simplest method: write 10–20 test queries where you already know what the correct result should be. This is your ground truth. Without it, you're flying blind.
  • A rough sense of data volume. Under 10,000 documents, you can use in-memory libraries and free tiers of vector databases. Over 100,000, you'll want a hosted vector database with proper indexing. Over 1 million, storage and query costs matter.

Choosing an Embedding Model

Three realistic options for most practitioners:

OpenAI's text-embedding-3-small and text-embedding-3-large. The small model produces 1536-dimensional vectors and costs roughly $0.02 per million tokens — cheap enough that cost rarely matters for small to mid-size datasets. The large model adds precision at about 5x the cost. Good default starting point if you're already using OpenAI.

Sentence Transformers (open-source). The all-MiniLM-L6-v2 model is fast, lightweight (384 dimensions), and runs locally at zero API cost. Quality is lower than the OpenAI large model but often sufficient for internal tools. Good choice if you have data privacy constraints or want to control your stack.

Cohere Embed. Strong multilingual support. If your content isn't primarily English, Cohere Embed v3 is worth evaluating before defaulting to OpenAI.

The single most important thing to remember: use the same model to embed your documents and your queries. Mixing models — even different versions of the same model family — breaks the comparison and produces garbage results.


Choosing a Vector Store

Your choice depends on scale and infrastructure preferences:

  • Chroma — Local-first, open-source, minimal setup. Best for prototyping and internal tools under ~100K documents. Python-native.
  • Pinecone — Hosted, fully managed. Good default for production with a generous free tier. No infrastructure to manage.
  • Weaviate — Open-source with a managed cloud option. Supports hybrid search (keyword + vector) natively, which matters when you want to combine semantic and exact matching.
  • pgvector — A PostgreSQL extension. If you're already running Postgres, pgvector adds vector similarity search without adding a new database to your stack. Excellent for teams that want to minimize infrastructure complexity.
  • Qdrant — Open-source, Rust-based, fast. Good for self-hosted production deployments that need performance and filtering together.

For a first real result, use Chroma locally or Pinecone's free tier. Either gets you to a working prototype in an afternoon.


Building Your First Retrieval Pipeline: The Five Steps

This is the minimal working pattern.

Step 1: Chunk Your Documents

Embedding models have token limits (typically 512–8192 tokens depending on the model). More importantly, smaller chunks improve retrieval precision — a 300-token chunk about a specific policy is more useful to retrieve than a 3,000-token document that mentions the policy twice. A reasonable starting point: 300–500 tokens per chunk, with a 50-token overlap between consecutive chunks to preserve context across boundaries.

Step 2: Embed Each Chunk

Pass each chunk through your chosen model. Store the resulting vector alongside metadata: source document name, chunk index, original text, any relevant filters (date, category, client ID). The metadata is what lets you show users a useful result, not just a similarity score.

Step 3: Index and Store

Insert vectors and metadata into your vector store. Most libraries make this a single function call. At this stage, your index is static — you'll need to re-embed and re-index when source documents change. Build a re-indexing workflow from the start; retrofitting it later is painful.

Step 4: Embed the Query and Search

When a query arrives, embed it with the same model. Run a similarity search against your index, requesting the top-k results (k = 3 to 5 is a reasonable starting point). Return the chunks with the highest similarity scores.

Step 5: Evaluate Against Your Ground Truth

Run your 10–20 test queries. Check whether the right chunks appear in the top-3 results. Common failure modes: wrong chunk size, mismatch between the way users phrase queries and the way documents are written, or an embedding model that doesn't fit your domain. Each failure mode has a specific fix, which is why having ground truth before you start matters.


Common Failure Modes and How to Fix Them

Understanding where this goes wrong before you're in production saves significant time. For a fuller treatment of how to measure AI system quality, How to Measure How Generative AI Works: Metrics That Matter covers evaluation frameworks that apply here too.

Retrieval returns plausible but wrong results. Often a chunking problem. Try smaller chunks, or add more descriptive metadata to each chunk (document title, section heading) and include it in the embedded text.

Good semantic results, but the answer isn't in your index. Coverage problem. Your dataset doesn't contain the answer. Retrieval can only surface what's there; it can't synthesize information that was never indexed.

Results are slow at scale. Check your index type. HNSW is fast but memory-intensive. IVF is more memory-efficient but slightly less accurate. Most managed vector databases handle this automatically — if you're managing your own, read the configuration documentation carefully.

Everything looks fine in testing, but quality degrades in production. Your test queries weren't representative. Collect real user queries from your first 100 users and use those to build a better evaluation set.

The cost and performance considerations here are similar to those you'd weigh in any AI infrastructure decision. The ROI of How Generative AI Works: Building the Business Case is a useful frame for thinking about when the build cost is justified.


What Comes After Basic Retrieval

Once basic retrieval works, there are three natural extensions:

Hybrid search combines vector similarity with keyword/BM25 scoring. It helps when users search for exact product names, codes, or jargon that semantic search can dilute. Weaviate and Elasticsearch both support this natively.

Reranking adds a second model pass after initial retrieval to re-score the top-k candidates for relevance. Cross-encoder models (Cohere Rerank, open-source cross-encoders from Sentence Transformers) regularly improve precision by 10–25% on real-world datasets. The cost is a slightly higher latency budget.

Metadata filtering lets you restrict search to a subset of your index — for example, only documents tagged for a specific client, or only content published after a certain date. This is essential for multi-tenant applications and is supported by every major vector database.

How you extend this system depends heavily on the use case. The landscape of AI capabilities is shifting quickly — How Generative AI Works: Trends and What to Expect in 2026 covers where retrieval-augmented systems are heading and which bets are worth making now.


Frequently Asked Questions

Do I need a vector database, or can I use a regular database?

For small datasets (under ~50,000 vectors), you can use an in-memory library like FAISS or Chroma and skip a dedicated vector database entirely. Once you need persistence, filtering, and scale, a proper vector store or an extension like pgvector is worth adding. The right choice depends on your existing infrastructure and operational comfort.

How long does it take to embed a large document corpus?

For typical document sets in the range of 10,000–50,000 chunks, embedding via API takes minutes to a couple of hours depending on rate limits and document size. Running a local model like Sentence Transformers on a modern laptop handles roughly 500–2,000 chunks per minute. Batch your API calls and cache results — re-embedding the same content repeatedly is wasteful and expensive.

What's the difference between semantic search and embeddings and vector search?

"Semantic search" is the use case — finding results by meaning rather than keyword match. "Embeddings and vector search" describes the technical mechanism that powers it. Embeddings encode meaning as vectors; vector search finds similar vectors efficiently. Semantic search is what you're building; embeddings and vector search are how you build it.

Can I use embeddings for things other than text search?

Yes. Embeddings are used for recommendation systems (find products similar to what a user has purchased), duplicate detection (find near-identical support tickets), classification (assign documents to categories based on vector proximity to labeled examples), and anomaly detection. The same core pattern — embed, index, search — applies across all of these.

How do I know if my retrieval quality is good enough?

Define "good enough" against your actual use case before you start. A FAQ search tool might need the correct answer in the top-3 results 90% of the time. An internal document assistant might tolerate the top-5 containing one irrelevant result. Measure recall@k (how often the correct result appears in your top-k) using your ground truth query set. If you're above your threshold, ship it. If you're not, iterate on chunking, model choice, or reranking.

Is vector search the same as what's used in large language models?

Related but not identical. Transformers use attention mechanisms that involve learned weight matrices — not the same as searching a vector index. However, the embeddings produced by transformer models are what you store in a vector database for retrieval. Think of it as: transformers create the representations; vector search lets you find them efficiently at query time.


Key Takeaways

  • An embedding is a numerical vector representing the meaning of content; vector search finds vectors most similar to a query vector.
  • Use the same embedding model for indexing and querying — mixing models breaks the comparison entirely.
  • Build a ground-truth evaluation set before indexing anything; without it you have no reliable signal on quality.
  • Start with Chroma or Pinecone's free tier and OpenAI's text-embedding-3-small model — this stack gets a prototype running in hours, not days.
  • Chunk size significantly affects retrieval quality; 300–500 tokens with 50-token overlap is a solid starting point.
  • Hybrid search and reranking are the two most reliable quality improvements once basic retrieval is working.
  • Embeddings and vector search are infrastructure, not magic — they surface what's in your data; they can't manufacture answers that aren't there.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification