The Pre-Ship Checklist for Keeping AI Answers Grounded

A checklist is only useful if you understand why each item is on it. A list you follow blindly becomes ritual; a list you understand becomes judgment. This is a working checklist for reducing hallucinations through prompting, and every item comes with a one-line reason so you can decide when it applies and when it does not.

Run it against any prompt before you ship. Not every item fits every task—a low-stakes draft generator needs less than a billing assistant—but the reasoning tells you which to keep. Treat the checklist as a forcing function that surfaces the question "have I actually addressed fabrication here, or just hoped it away?"

For the concepts underneath these checks, see Stop Your Model From Inventing Facts at the Prompt Layer.

Before You Write the Prompt

The groundwork that determines whether the prompt can succeed at all.

Have I scoped the task to one clear job?

A narrow, well-defined task has less surface area for invention than a sprawling one. If you cannot state the job in a sentence, the model cannot stay inside it.

Do I have source material for the facts involved?

If the answer should come from specific facts, you need a source to ground in. No source means the model works from memory, and memory fabricates. Decide this before writing anything.

Have I identified the questions the source cannot answer?

Knowing where the gaps are tells you where fabrication will strike. These unanswerable cases become your most important test material.

Inside the Prompt

The checks that govern the prompt's actual content.

Did I supply the source material, clearly delimited?

Grounding converts a guess into a lookup, and delimiting keeps the model from confusing your data with your instructions. Both are required; one without the other leaks.

Did I instruct the model to answer only from that source?

Supplying context is not enough if the model is still free to blend in memory. The instruction to use only the source is what closes the book.

Did I include an explicit abstention clause?

Without permission to say "I do not have that," the model answers everything, including what it cannot support. State concretely what to say when the answer is missing. This pairs with the work in Build a Fabrication-Resistant Prompt in Eight Moves.

Did I require evidence for each claim?

A citation or quote requirement forces a self-check and exposes unsupported claims as visible gaps. The absence of evidence becomes a signal to abstain.

Did I constrain the output shape?

Structured output—fields, bounded lists, defined slots—gives the model fewer places to embellish. Pair it with permission to leave slots empty so structure does not force a guess.

Before You Ship

The validation that turns a careful-looking prompt into a verified one.

Did I test on unanswerable questions?

This is non-negotiable. Fabrication lives in the questions the source cannot answer, and a prompt that abstains on all of them is doing its job. Skipping this means shipping blind.

Did I measure over-correction?

Track how often the model abstains on questions the source actually answers. Maximum caution that refuses answerable questions is its own failure. The target is calibration.

Did I add a verification pass for high-stakes output?

For answers where a confident error causes real harm, a separate verification pass catches what generation missed. Skip it only when errors are cheap.

Did I compare against the previous version?

Every prompt edit can fix one failure and create another. Rerunning your test set on the old and new prompt shows net movement instead of guesswork. For the mistakes this catches, see 7 Prompting Habits That Make AI Fabricate More, Not Less.

After You Ship

The ongoing checks that keep the prompt honest over time.

Am I sampling real production answers?

Live questions differ from your test set, and new failure modes appear in the wild. Periodic sampling catches drift the test set never anticipated.

Am I watching retrieval quality, not just the prompt?

When grounding fails, the cause is often upstream—the wrong passages were retrieved. A grounded-but-wrong answer points to retrieval, not the prompt, and the checklist should remind you to look there. This connection is explored in Grounding Prompts in Action: Five Scenarios That Tell.

Using the Checklist Without It Becoming Ritual

A checklist can decay into box-ticking, where items get marked done without thought. A few habits keep it a living tool rather than a formality.

Tie each check to a test, not a feeling

Wherever possible, attach a checklist item to evidence rather than a judgment. "Did I include an abstention clause?" is weak if answered from memory; it is strong if answered by pointing at the clause and the unanswerable-question test results that show it works. The checklist earns trust when its items map to observable outcomes.

Adapt the depth to the task

The same checklist serves a throwaway draft tool and a regulated advice system, but they do not run the same items. Decide up front how far down the list a given task warrants, based on what a wrong answer costs. Recording that decision keeps the checklist honest and prevents both over-engineering low stakes and under-protecting high ones.

Revisit after every incident

When a fabrication slips through to production, return to the checklist and ask which item failed or was missing. Often the incident reveals a gap the list did not cover, and the right response is to add an item. Treated this way, the checklist grows sharper over time instead of going stale, absorbing the lessons of Grounding Prompts in Action: Five Scenarios That Tell and your own production history.

Frequently Asked Questions

Do I need to run every item for every prompt?

No. The items before and inside the prompt apply broadly, but the verification pass and intensive testing scale with stakes. A throwaway draft generator needs the core grounding and abstention checks; a billing assistant needs the whole list. The justifications tell you which to keep.

What is the one item I should never skip?

Testing on unanswerable questions. Fabrication concentrates in cases the source cannot answer, and if you never test those, you have no idea whether your prompt abstains or invents. Every other check is undermined if this one is missing.

How often should I re-run the checklist after shipping?

Whenever you change the prompt, change the retrieval, or notice complaints, and on a periodic cadence regardless. Production questions drift from your test set, and new failure modes surface over time, so a launched prompt is never permanently done.

Why include a check on retrieval quality in a prompting checklist?

Because grounded-but-wrong answers look like prompt failures but are usually retrieval failures—the model faithfully grounded in the wrong passages. The checklist reminds you to look upstream before blaming the prompt, which saves you from tuning the wrong thing.

Can I automate this checklist?

The testing items—running answerable and unanswerable cases, comparing versions, sampling production—lend themselves well to automation. The judgment items, like whether the task is scoped tightly, still need a human read. Automate the measurement, keep a person on the design questions.

Key Takeaways

A checklist works only when you understand the reason behind each item, so you can judge when it applies.
Before writing, scope the task, secure source material, and identify the questions the source cannot answer.
Inside the prompt, supply and delimit the source, instruct answering only from it, add abstention, require evidence, and constrain output.
Before shipping, test on unanswerable questions, measure over-correction, add verification for high-stakes output, and compare against the prior version.
After shipping, sample real answers and watch retrieval quality, since grounded-but-wrong answers usually point upstream rather than to the prompt.

For the concepts underneath these checks, see Stop Your Model From Inventing Facts at the Prompt Layer.

Before You Write the Prompt

The groundwork that determines whether the prompt can succeed at all.

Have I scoped the task to one clear job?

A narrow, well-defined task has less surface area for invention than a sprawling one. If you cannot state the job in a sentence, the model cannot stay inside it.

Do I have source material for the facts involved?

If the answer should come from specific facts, you need a source to ground in. No source means the model works from memory, and memory fabricates. Decide this before writing anything.

Have I identified the questions the source cannot answer?

Knowing where the gaps are tells you where fabrication will strike. These unanswerable cases become your most important test material.

Inside the Prompt

The checks that govern the prompt's actual content.

Did I supply the source material, clearly delimited?

Grounding converts a guess into a lookup, and delimiting keeps the model from confusing your data with your instructions. Both are required; one without the other leaks.

Did I instruct the model to answer only from that source?

Supplying context is not enough if the model is still free to blend in memory. The instruction to use only the source is what closes the book.

Did I include an explicit abstention clause?

Did I require evidence for each claim?

A citation or quote requirement forces a self-check and exposes unsupported claims as visible gaps. The absence of evidence becomes a signal to abstain.

Did I constrain the output shape?

Structured output—fields, bounded lists, defined slots—gives the model fewer places to embellish. Pair it with permission to leave slots empty so structure does not force a guess.

Before You Ship

The validation that turns a careful-looking prompt into a verified one.

Did I test on unanswerable questions?

This is non-negotiable. Fabrication lives in the questions the source cannot answer, and a prompt that abstains on all of them is doing its job. Skipping this means shipping blind.

Did I measure over-correction?

Track how often the model abstains on questions the source actually answers. Maximum caution that refuses answerable questions is its own failure. The target is calibration.

Did I add a verification pass for high-stakes output?

For answers where a confident error causes real harm, a separate verification pass catches what generation missed. Skip it only when errors are cheap.

Did I compare against the previous version?

After You Ship

The ongoing checks that keep the prompt honest over time.

Am I sampling real production answers?

Live questions differ from your test set, and new failure modes appear in the wild. Periodic sampling catches drift the test set never anticipated.

Am I watching retrieval quality, not just the prompt?

Using the Checklist Without It Becoming Ritual

A checklist can decay into box-ticking, where items get marked done without thought. A few habits keep it a living tool rather than a formality.

Tie each check to a test, not a feeling

Adapt the depth to the task

Revisit after every incident

Frequently Asked Questions

Do I need to run every item for every prompt?

What is the one item I should never skip?

How often should I re-run the checklist after shipping?

Why include a check on retrieval quality in a prompting checklist?

Can I automate this checklist?

Key Takeaways

A checklist works only when you understand the reason behind each item, so you can judge when it applies.
Before writing, scope the task, secure source material, and identify the questions the source cannot answer.
Inside the prompt, supply and delimit the source, instruct answering only from it, add abstention, require evidence, and constrain output.
Before shipping, test on unanswerable questions, measure over-correction, add verification for high-stakes output, and compare against the prior version.
After shipping, sample real answers and watch retrieval quality, since grounded-but-wrong answers usually point upstream rather than to the prompt.

The Pre-Ship Checklist for Keeping AI Answers Grounded

Before You Write the Prompt

Have I scoped the task to one clear job?

Do I have source material for the facts involved?

Have I identified the questions the source cannot answer?

Inside the Prompt

Did I supply the source material, clearly delimited?

Did I instruct the model to answer only from that source?

Did I include an explicit abstention clause?

Did I require evidence for each claim?

Did I constrain the output shape?

Before You Ship

Did I test on unanswerable questions?

Did I measure over-correction?

Did I add a verification pass for high-stakes output?

Did I compare against the previous version?

After You Ship

Am I sampling real production answers?

Am I watching retrieval quality, not just the prompt?

Using the Checklist Without It Becoming Ritual

Tie each check to a test, not a feeling

Adapt the depth to the task

Revisit after every incident

Frequently Asked Questions

Do I need to run every item for every prompt?

What is the one item I should never skip?

How often should I re-run the checklist after shipping?

Why include a check on retrieval quality in a prompting checklist?

Can I automate this checklist?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

The Pre-Ship Checklist for Keeping AI Answers Grounded

Before You Write the Prompt

Have I scoped the task to one clear job?

Do I have source material for the facts involved?

Have I identified the questions the source cannot answer?

Inside the Prompt

Did I supply the source material, clearly delimited?

Did I instruct the model to answer only from that source?

Did I include an explicit abstention clause?

Did I require evidence for each claim?

Did I constrain the output shape?

Before You Ship

Did I test on unanswerable questions?

Did I measure over-correction?

Did I add a verification pass for high-stakes output?

Did I compare against the previous version?

After You Ship

Am I sampling real production answers?

Am I watching retrieval quality, not just the prompt?

Using the Checklist Without It Becoming Ritual

Tie each check to a test, not a feeling

Adapt the depth to the task

Revisit after every incident

Frequently Asked Questions

Do I need to run every item for every prompt?

What is the one item I should never skip?

How often should I re-run the checklist after shipping?

Why include a check on retrieval quality in a prompting checklist?

Can I automate this checklist?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?