Operating Plays for the Day RAG Meets Real Users

A tutorial tells you how to build something once. A playbook tells you what to do when reality hits — when retrieval quality drops, when a new content source appears, when costs spike, when a stakeholder asks why the bot gave a wrong answer. This is the operating playbook for running retrieval augmented generation as a system you maintain, not a demo you shipped.

Each play below has a trigger (what tells you to run it), an owner (who runs it), and a sequence (the actual steps). The plays are ordered roughly by lifecycle: setup, launch, operate, recover. Use this alongside The Complete Guide to Retrieval Augmented Generation for the underlying mechanics — here we focus on the operational moves.

Play 1: Scope the corpus

Trigger: You're starting a RAG project. Owner: Product lead plus a subject-matter expert.

Before any engineering, decide exactly which documents are in scope and which are explicitly out. The most common early failure is indexing everything indiscriminately — old drafts, contradictory policies, internal scratch notes — and then wondering why answers conflict.

The sequence:

List every candidate source and tag each as authoritative, supporting, or excluded.
Resolve contradictions at the source. If two documents disagree on policy, fix the documents, don't ask retrieval to arbitrate.
Define a freshness requirement per source: real-time, daily, or static.

A tight authoritative corpus of 1,000 documents beats a sloppy one of 50,000 every time.

One concrete trap to watch for in this play: near-duplicate documents. Many organizations have the same policy living in three slightly different versions across a wiki, a shared drive, and an email archive. Index all three and retrieval returns conflicting passages for the same question, and the model picks one essentially at random. Deduplicate during scoping, not after launch when users are already getting inconsistent answers.

Play 2: Build the evaluation set first

Trigger: Corpus is scoped, before building the pipeline. Owner: Whoever will own quality long-term.

Write 50-100 real questions users will ask, each with the known correct source passage. This set is your ground truth for every decision that follows — chunk size, embedding model, reranker, top-k. Without it you're tuning blind.

The step-by-step approach to RAG covers how to construct this set so it actually represents production traffic. Skipping this play is the single most expensive shortcut in the whole lifecycle, because every later decision becomes a guess.

Play 3: Tune retrieval before touching the prompt

Trigger: Pipeline is wired and answers are mediocre. Owner: ML or backend engineer.

The instinct is to rewrite the prompt. Resist it. Run your evaluation set and measure retrieval recall first.

The sequence:

Check recall: is the correct passage in the retrieved set at all?
If no, fix retrieval — adjust chunking, add a reranker, try hybrid keyword-plus-vector search.
Only once recall is solid do you tune the generation prompt.

This ordering matters because prompt tweaks can't fix missing context. Retrieval Augmented Generation: Best Practices That Actually Work breaks down the retrieval levers in detail.

Play 4: Ship behind a confidence gate

Trigger: Retrieval and answers pass your evaluation bar. Owner: Engineering lead.

Don't launch a system that always answers. Launch one that knows when it can't.

If retrieval returns nothing above a similarity threshold, refuse and say so.
If the answer can't be grounded in the retrieved passages, escalate to a human or return "I don't have that information."
Log every refusal — refusals are your highest-signal source of corpus gaps.

A system that says "I don't know" 10% of the time and is right the other 90% beats one that's confidently wrong 20% of the time.

Play 5: Run the freshness loop

Trigger: Ongoing, on a schedule. Owner: Data or platform engineer.

The corpus drifts the moment you launch. Documents get updated; new ones appear; old ones get deprecated. Without a sync loop, quality decays silently.

Sequence

Detect changed source documents (webhook, polling, or scheduled crawl).
Re-chunk and re-embed only what changed.
Remove deprecated documents from the index, not just the source.

Stale indexes are one of the most common production failures, and they're invisible until a user asks about something that changed last week.

Play 6: Watch the cost and latency budget

Trigger: Monthly, or when bills jump. Owner: Engineering lead.

RAG costs creep because every retrieved chunk is input tokens on every query.

Track tokens per query and cost per query as first-class metrics.
If both rise, you're probably retrieving too many chunks — rerank down to the top 3-5.
Cache embeddings for repeated queries and consider caching full answers for common questions.

Retrieving fewer, better chunks is the rare lever that cuts cost and improves quality simultaneously. Run it first.

Play 7: Run the incident postmortem

Trigger: A wrong or harmful answer reaches a user. Owner: Quality owner.

When RAG fails in production, resist the urge to patch the prompt and move on. Diagnose where in the pipeline it broke.

Pull the logged retrieval results for that query. Was the right passage retrieved?
If not retrieved → retrieval problem (chunking, index, ranking).
If retrieved but ignored → generation problem (prompt, context order).
If retrieved and the passage itself was wrong → corpus problem (Play 1).

Add the failed query to your evaluation set so the same failure can never silently return. 7 Common Mistakes with Retrieval Augmented Generation maps the most frequent failure signatures to fixes.

The discipline that separates mature teams here is resisting the quick patch. It's tempting to add one line to the prompt that says "don't make that specific mistake," ship it, and move on. That accumulates into an unmaintainable prompt full of special cases, and it never addresses the root cause. Diagnose the stage, fix the stage, and let the evaluation set prove the fix holds.

Play 8: Communicate confidence to stakeholders

Trigger: A stakeholder asks how reliable the system is, or proposes expanding its scope. Owner: Product lead.

RAG projects often fail politically, not technically — a leader expects 100% accuracy, gets 92%, and declares the project a failure. Manage the expectation before it manages you.

Share the evaluation results honestly, including where the system refuses or struggles.
Frame the confidence gate as a feature: the system knows its limits.
Tie any scope expansion back to Play 1 — new scope means new corpus work, not a free add-on.

A stakeholder who understands the system answers 90% of questions well and declines the rest is an ally. One who was promised perfection is a liability waiting to happen.

Frequently Asked Questions

How is a playbook different from a tutorial?

A tutorial is linear — do step one, then step two, build the thing. A playbook is conditional — when X happens, the owner runs play Y. Playbooks are for the operating phase, when you're reacting to triggers like quality drops, cost spikes, and incidents rather than building from scratch.

Who should own a RAG system day to day?

Split it. A quality owner watches retrieval and answer metrics and runs the evaluation set; a platform engineer owns the freshness loop and infrastructure. The failure mode is treating RAG as a one-time build with no owner, which guarantees silent decay.

How often should I re-run the evaluation set?

On every meaningful change — new embedding model, chunking change, corpus expansion — and on a regular cadence regardless, weekly or monthly. The evaluation set is your regression test. If you only run it at launch, you won't notice when an index update quietly breaks retrieval.

What's the right team size to start?

A RAG pilot can run with two or three people: one product or domain lead to scope the corpus and write the evaluation set, and one or two engineers to build the pipeline and freshness loop. Scope creep on the corpus, not headcount, is what usually sinks early projects.

When do I escalate from a pilot to a real platform?

When you have a working evaluation set passing your quality bar, a confidence gate, and a freshness loop running. Those three plays in place mean you have a system, not a demo. Scaling infrastructure before those exist just makes failures more expensive.

Key Takeaways

Run RAG as a set of triggered plays with named owners, not a one-time build.
Scope a tight authoritative corpus and build the evaluation set before any pipeline work.
Tune retrieval before the prompt — prompt tweaks can't fix missing context.
Ship behind a confidence gate that refuses when it can't ground an answer; log every refusal.
Run a freshness loop so the index never drifts from the source content.
On every incident, diagnose which pipeline stage failed and add the case to your evaluation set.

Play 1: Scope the corpus

Trigger: You're starting a RAG project. Owner: Product lead plus a subject-matter expert.

The sequence:

List every candidate source and tag each as authoritative, supporting, or excluded.
Resolve contradictions at the source. If two documents disagree on policy, fix the documents, don't ask retrieval to arbitrate.
Define a freshness requirement per source: real-time, daily, or static.

A tight authoritative corpus of 1,000 documents beats a sloppy one of 50,000 every time.

Play 2: Build the evaluation set first

Trigger: Corpus is scoped, before building the pipeline. Owner: Whoever will own quality long-term.

Play 3: Tune retrieval before touching the prompt

Trigger: Pipeline is wired and answers are mediocre. Owner: ML or backend engineer.

The instinct is to rewrite the prompt. Resist it. Run your evaluation set and measure retrieval recall first.

The sequence:

Check recall: is the correct passage in the retrieved set at all?
If no, fix retrieval — adjust chunking, add a reranker, try hybrid keyword-plus-vector search.
Only once recall is solid do you tune the generation prompt.

This ordering matters because prompt tweaks can't fix missing context. Retrieval Augmented Generation: Best Practices That Actually Work breaks down the retrieval levers in detail.

Play 4: Ship behind a confidence gate

Trigger: Retrieval and answers pass your evaluation bar. Owner: Engineering lead.

Don't launch a system that always answers. Launch one that knows when it can't.

If retrieval returns nothing above a similarity threshold, refuse and say so.
If the answer can't be grounded in the retrieved passages, escalate to a human or return "I don't have that information."
Log every refusal — refusals are your highest-signal source of corpus gaps.

A system that says "I don't know" 10% of the time and is right the other 90% beats one that's confidently wrong 20% of the time.

Play 5: Run the freshness loop

Trigger: Ongoing, on a schedule. Owner: Data or platform engineer.

The corpus drifts the moment you launch. Documents get updated; new ones appear; old ones get deprecated. Without a sync loop, quality decays silently.

Sequence

Detect changed source documents (webhook, polling, or scheduled crawl).
Re-chunk and re-embed only what changed.
Remove deprecated documents from the index, not just the source.

Stale indexes are one of the most common production failures, and they're invisible until a user asks about something that changed last week.

Play 6: Watch the cost and latency budget

Trigger: Monthly, or when bills jump. Owner: Engineering lead.

RAG costs creep because every retrieved chunk is input tokens on every query.

Track tokens per query and cost per query as first-class metrics.
If both rise, you're probably retrieving too many chunks — rerank down to the top 3-5.
Cache embeddings for repeated queries and consider caching full answers for common questions.

Retrieving fewer, better chunks is the rare lever that cuts cost and improves quality simultaneously. Run it first.

Play 7: Run the incident postmortem

Trigger: A wrong or harmful answer reaches a user. Owner: Quality owner.

When RAG fails in production, resist the urge to patch the prompt and move on. Diagnose where in the pipeline it broke.

Pull the logged retrieval results for that query. Was the right passage retrieved?
If not retrieved → retrieval problem (chunking, index, ranking).
If retrieved but ignored → generation problem (prompt, context order).
If retrieved and the passage itself was wrong → corpus problem (Play 1).

Add the failed query to your evaluation set so the same failure can never silently return. 7 Common Mistakes with Retrieval Augmented Generation maps the most frequent failure signatures to fixes.

Play 8: Communicate confidence to stakeholders

Trigger: A stakeholder asks how reliable the system is, or proposes expanding its scope. Owner: Product lead.

RAG projects often fail politically, not technically — a leader expects 100% accuracy, gets 92%, and declares the project a failure. Manage the expectation before it manages you.

Share the evaluation results honestly, including where the system refuses or struggles.
Frame the confidence gate as a feature: the system knows its limits.
Tie any scope expansion back to Play 1 — new scope means new corpus work, not a free add-on.

A stakeholder who understands the system answers 90% of questions well and declines the rest is an ally. One who was promised perfection is a liability waiting to happen.

Frequently Asked Questions

How is a playbook different from a tutorial?

Who should own a RAG system day to day?

How often should I re-run the evaluation set?

What's the right team size to start?

When do I escalate from a pilot to a real platform?

Key Takeaways

Run RAG as a set of triggered plays with named owners, not a one-time build.
Scope a tight authoritative corpus and build the evaluation set before any pipeline work.
Tune retrieval before the prompt — prompt tweaks can't fix missing context.
Ship behind a confidence gate that refuses when it can't ground an answer; log every refusal.
Run a freshness loop so the index never drifts from the source content.
On every incident, diagnose which pipeline stage failed and add the case to your evaluation set.

Operating Plays for the Day RAG Meets Real Users

Play 1: Scope the corpus

Play 2: Build the evaluation set first

Play 3: Tune retrieval before touching the prompt

Play 4: Ship behind a confidence gate

Play 5: Run the freshness loop

Sequence

Play 6: Watch the cost and latency budget

Play 7: Run the incident postmortem

Play 8: Communicate confidence to stakeholders

Frequently Asked Questions

How is a playbook different from a tutorial?

Who should own a RAG system day to day?

How often should I re-run the evaluation set?

What's the right team size to start?

When do I escalate from a pilot to a real platform?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Operating Plays for the Day RAG Meets Real Users

Play 1: Scope the corpus

Play 2: Build the evaluation set first

Play 3: Tune retrieval before touching the prompt

Play 4: Ship behind a confidence gate

Play 5: Run the freshness loop

Sequence

Play 6: Watch the cost and latency budget

Play 7: Run the incident postmortem

Play 8: Communicate confidence to stakeholders

Frequently Asked Questions

How is a playbook different from a tutorial?

Who should own a RAG system day to day?

How often should I re-run the evaluation set?

What's the right team size to start?

When do I escalate from a pilot to a real platform?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?