Predictable RAG Questions, Because Failures Are Predictable

Every team that starts evaluating retrieval augmented generation arrives with roughly the same list of questions. Does it actually stop the model from making things up? How is it different from fine-tuning? Why is my prototype worse than the demo? What does it cost to run at scale? The questions are predictable because the failure modes are predictable.

This article answers them directly, in the order people usually ask them. No vendor framing, no hand-waving. Where a question has a real trade-off, we name the trade-off instead of pretending there's a clean win. If you want the broader walkthrough, The Complete Guide to Retrieval Augmented Generation covers the full architecture, and Retrieval Augmented Generation: A Beginner's Guide covers the basics. This is the FAQ you keep open during the project.

What is RAG, in one sentence?

RAG is a pattern where, before the language model answers, you search a knowledge source for relevant passages and paste them into the prompt as context. The model then answers using that retrieved text rather than its training data alone.

That's the whole idea. Everything else — embeddings, vector databases, rerankers, chunking — is machinery in service of two steps: find the right passages, then put them in front of the model. If retrieval surfaces the wrong passages, no amount of prompt engineering downstream will save the answer.

Does RAG stop hallucinations?

It reduces them. It does not eliminate them. This is the single most over-sold claim in the space.

RAG helps in three specific ways:

The model has the actual answer in context, so it doesn't have to invent one.
You can require citations, which makes fabrication easier to catch.
You can detect "no relevant passages found" and refuse instead of guessing.

But the model can still misread a retrieved passage, blend two sources incorrectly, or ignore the context and answer from memory anyway. And if retrieval returns a confidently wrong document, RAG will faithfully ground a wrong answer. Grounding is only as good as what you retrieve. Treat RAG as a large reduction in hallucination risk, not a guarantee.

When should I use RAG instead of fine-tuning?

These solve different problems, and the confusion costs teams real money.

Use RAG when the answer depends on facts that change, are private to your organization, or need a citation — documentation, policies, product catalogs, support tickets.
Use fine-tuning when you need to change the model's behavior, format, or tone — making it always respond in a structured JSON, adopt a house style, or handle a narrow classification task.

You can use both. A common pattern is fine-tuning for format and RAG for facts. But if your real problem is "the model doesn't know our current pricing," fine-tuning is the wrong tool — you'd have to retrain every time pricing changes, while RAG just reads the latest document.

Why is my prototype worse than the demos?

Almost always retrieval, almost never the model. The demo used clean, well-structured content and easy questions. Your real corpus has tables, PDFs, duplicated pages, and ambiguous queries.

The usual culprits, in order:

Bad chunking — passages split mid-sentence or mid-table so no chunk contains a complete answer.
No reranking — vector similarity returns topically related but not actually-answering passages.
Stale index — the document was updated but the index wasn't re-embedded.
Over-stuffed context — you retrieved 20 chunks and buried the relevant one in noise.

7 Common Mistakes with Retrieval Augmented Generation walks through each of these with fixes. The good news: retrieval problems are diagnosable. Log what got retrieved for each failed answer and the pattern shows up fast.

How much does RAG cost to run?

Three cost centers, and they scale differently:

Embedding — a one-time cost per document plus re-embedding on updates. Cheap per document, but a 500,000-document corpus adds up, and re-embedding everything when you change models hurts.
Vector storage and search — ongoing, roughly proportional to corpus size and query volume. Managed vector databases bill on stored vectors and reads.
Generation — usually the largest line item, because every retrieved chunk you pass costs input tokens on every query.

The lever most teams miss: retrieving fewer, better chunks cuts generation cost and improves answer quality at the same time. Reranking down to the top 3-5 passages instead of dumping 15 is often the highest-ROI change you can make.

A second lever is caching. If your traffic has repeated questions — and most support and internal-knowledge traffic does — caching answers for common queries removes the generation cost entirely for cache hits. Embedding caches help too: you never need to re-embed an unchanged query. Neither lever requires a model upgrade or a bigger budget. They just require paying attention to where the tokens actually go.

How do I know if it's working?

Measure retrieval and generation separately, or you'll chase the wrong problem.

Retrieval metrics

Build a set of 50-100 real questions with known correct source passages. Then measure whether the right passage appears in the retrieved set (recall) and how high it ranks. If recall is low, the generator never had a chance.

Answer metrics

For the answers themselves, track faithfulness (does the answer match the retrieved context?) and relevance (does it address the question?). Human spot-checks plus an LLM-as-judge on a sample is enough to catch regressions. The step-by-step approach covers building this evaluation set before you ship.

Do I need a vector database?

For a few thousand documents, no — you can hold embeddings in memory or use a Postgres extension. Vector databases earn their keep at scale, with high query volume, or when you need metadata filtering and hybrid search out of the box.

Start with the simplest thing that works and graduate when you have a real reason. Picking a heavyweight vector database for a 2,000-document internal wiki is a common over-engineering trap. See The Best Tools for Retrieval Augmented Generation for how to match tooling to scale.

Should I build it myself or use a managed RAG service?

The honest answer depends on how much control you need over retrieval quality. Managed RAG-as-a-service products get you to a working demo fast — they handle chunking, embedding, and retrieval behind one API. That's a real advantage when you're validating whether RAG solves your problem at all.

The trade-off shows up later. When answers are mediocre and you need to tune chunking or swap the reranker, a black-box service may not let you. The teams that hit a quality ceiling are usually the ones who outsourced the parts that most affect quality. A reasonable middle path: prototype on a managed service to prove the use case, then bring retrieval in-house if and when you need finer control. Don't build your own vector infrastructure on day one to answer questions over a few thousand documents — that's effort spent in the wrong place.

Frequently Asked Questions

Can RAG work with my private or sensitive data?

Yes, and it's one of the main reasons to use it. Your documents stay in your own store and are only passed to the model at query time as context. Just make sure your retrieval layer respects access controls — a user should only retrieve documents they're permitted to see, or RAG becomes a data-leak vector.

How current can RAG answers be?

As current as your index. If you re-embed a document the moment it changes, RAG can answer using information that's seconds old. The lag isn't in the model — it's in how often you sync your source content into the index.

Does RAG work for non-text data like images or tables?

Tables and images need extra handling. Tables should be chunked so a full row stays together, often converted to a text representation. Images require multimodal embeddings or a captioning step. Plain vanilla text-chunking destroys tables, which is a frequent and silent source of wrong answers.

How big should my chunks be?

There's no universal answer, but a common starting range is 200-500 tokens with some overlap between adjacent chunks. Smaller chunks improve retrieval precision; larger chunks preserve context. Test both against your evaluation set rather than guessing — chunk size has outsized impact.

Is RAG hard to maintain?

The model isn't the maintenance burden — the content pipeline is. You need a process to keep the index in sync as documents change, plus monitoring on retrieval quality. Teams that treat RAG as "build once" watch quality quietly decay as their corpus drifts.

Key Takeaways

RAG retrieves relevant passages and puts them in the prompt; quality lives or dies on retrieval, not the model.
It reduces hallucinations substantially but does not eliminate them — bad retrieval grounds wrong answers.
Use RAG for changing or private facts; use fine-tuning for behavior and format. They combine well.
Most prototype failures are retrieval problems: chunking, reranking, or stale indexes.
Generation tokens are usually the biggest cost; retrieving fewer, better chunks cuts cost and improves quality at once.
Measure retrieval and answer quality separately with a real evaluation set before you trust the system.

What is RAG, in one sentence?

Does RAG stop hallucinations?

It reduces them. It does not eliminate them. This is the single most over-sold claim in the space.

RAG helps in three specific ways:

The model has the actual answer in context, so it doesn't have to invent one.
You can require citations, which makes fabrication easier to catch.
You can detect "no relevant passages found" and refuse instead of guessing.

When should I use RAG instead of fine-tuning?

These solve different problems, and the confusion costs teams real money.

Use RAG when the answer depends on facts that change, are private to your organization, or need a citation — documentation, policies, product catalogs, support tickets.
Use fine-tuning when you need to change the model's behavior, format, or tone — making it always respond in a structured JSON, adopt a house style, or handle a narrow classification task.

Why is my prototype worse than the demos?

Almost always retrieval, almost never the model. The demo used clean, well-structured content and easy questions. Your real corpus has tables, PDFs, duplicated pages, and ambiguous queries.

The usual culprits, in order:

Bad chunking — passages split mid-sentence or mid-table so no chunk contains a complete answer.
No reranking — vector similarity returns topically related but not actually-answering passages.
Stale index — the document was updated but the index wasn't re-embedded.
Over-stuffed context — you retrieved 20 chunks and buried the relevant one in noise.

How much does RAG cost to run?

Three cost centers, and they scale differently:

Embedding — a one-time cost per document plus re-embedding on updates. Cheap per document, but a 500,000-document corpus adds up, and re-embedding everything when you change models hurts.
Vector storage and search — ongoing, roughly proportional to corpus size and query volume. Managed vector databases bill on stored vectors and reads.
Generation — usually the largest line item, because every retrieved chunk you pass costs input tokens on every query.

How do I know if it's working?

Measure retrieval and generation separately, or you'll chase the wrong problem.

Retrieval metrics

Answer metrics

Do I need a vector database?

Should I build it myself or use a managed RAG service?

Frequently Asked Questions

Can RAG work with my private or sensitive data?

How current can RAG answers be?

Does RAG work for non-text data like images or tables?

How big should my chunks be?

Is RAG hard to maintain?

Key Takeaways

RAG retrieves relevant passages and puts them in the prompt; quality lives or dies on retrieval, not the model.
It reduces hallucinations substantially but does not eliminate them — bad retrieval grounds wrong answers.
Use RAG for changing or private facts; use fine-tuning for behavior and format. They combine well.
Most prototype failures are retrieval problems: chunking, reranking, or stale indexes.
Generation tokens are usually the biggest cost; retrieving fewer, better chunks cuts cost and improves quality at once.
Measure retrieval and answer quality separately with a real evaluation set before you trust the system.

Predictable RAG Questions, Because Failures Are Predictable

What is RAG, in one sentence?

Does RAG stop hallucinations?

When should I use RAG instead of fine-tuning?

Why is my prototype worse than the demos?

How much does RAG cost to run?

How do I know if it's working?

Retrieval metrics

Answer metrics

Do I need a vector database?

Should I build it myself or use a managed RAG service?

Frequently Asked Questions

Can RAG work with my private or sensitive data?

How current can RAG answers be?

Does RAG work for non-text data like images or tables?

How big should my chunks be?

Is RAG hard to maintain?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Predictable RAG Questions, Because Failures Are Predictable

What is RAG, in one sentence?

Does RAG stop hallucinations?

When should I use RAG instead of fine-tuning?

Why is my prototype worse than the demos?

How much does RAG cost to run?

How do I know if it's working?

Retrieval metrics

Answer metrics

Do I need a vector database?

Should I build it myself or use a managed RAG service?

Frequently Asked Questions

Can RAG work with my private or sensitive data?

How current can RAG answers be?

Does RAG work for non-text data like images or tables?

How big should my chunks be?

Is RAG hard to maintain?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?