Vetting a Model's Numbers Before You Rely on Them

A checklist is only useful if you actually run it, and you will only run it if it is short, ordered, and each item earns its place. This is a working tool, not a reference document. Before you trust a number a language model gave you, run down these checks. Each one comes with a one-line reason so you know why it is there and can drop it deliberately when it does not apply.

The list is organized by phase: how you set up the problem, how you prompt for the answer, and how you verify the result. Run it top to bottom for high-stakes numbers. For casual work, the setup and prompting items alone catch most errors. The point is to make good numerical practice a routine you execute rather than a set of principles you half-remember.

Keep this somewhere you can glance at it while working. Over time the items become habit and you will stop needing the list — which is exactly the goal. Until then, the checklist is what stands between you and a confidently wrong figure.

Before You Ask: Setting Up the Problem

Most numerical errors are decided before any calculation happens, in how the problem is framed.

The Setup Checks

Every quantity has an explicit unit. Ambiguous units are a leading cause of wrong-but-confident answers.
Each number's meaning is stated. "The growth figure" means different things to you and the model unless you define it.
The expected answer format is specified. Telling the model whether you want dollars, a percentage, or a count keeps it from solving a different problem.
The problem reads unambiguously to a stranger. If someone with no context could interpret it two ways, so can the model.

These setup checks build on the framing discipline in Build a Repeatable Workflow for Math You Can Rely On.

While You Ask: Prompting for the Answer

How you request the calculation determines whether the model's reasoning is reliable and auditable.

The Prompting Checks

Step-by-step reasoning is requested. Visible steps turn one hard prediction into several easy ones and give you an audit trail.
Compound calculations are split into stages. One prompt doing four operations gives four invisible chances to fail.
The formula or approach is stated before the numbers. Separating logic from arithmetic catches method errors before they hide in a result.
Exact arithmetic is offloaded to code or a tool when available. Deterministic computation removes the model's core weakness.

The reasoning behind offloading arithmetic is in Getting Language Models to Do Math They Can Actually Trust.

After You Ask: Verifying the Result

A number is not trustworthy until it has survived a check proportional to its stakes.

The Verification Checks

The result passes a plausibility glance. An answer ten times too large or with the wrong sign is caught in seconds.
The result obeys obvious constraints. No negative counts, no percentages over 100 unless that genuinely makes sense.
High-stakes numbers are recomputed independently. Two methods that agree give real confidence; two that disagree caught an error.
Numbers inside prose are computed and verified separately. Figures buried in narrative hide their errors behind fluent writing.

The failure modes these checks guard against are catalogued in 7 Mistakes That Wreck Numerical Reasoning Prompts.

Running the Checklist by Stakes

The full list is for figures that matter; lighter work needs only part of it.

Tiering the Effort

For a throwaway estimate, the setup and prompting checks alone catch the errors worth catching. For a number going to a client, into a contract, or driving a decision with money attached, run the verification checks too, including independent recomputation. Match the depth of checking to the cost of being wrong rather than applying everything everywhere.

Making It a Habit

Run the full list deliberately a few times and the items start happening automatically. The goal is internalized practice, at which point the written checklist becomes a backstop for the high-stakes cases rather than a constant reference. The practices behind these habits are in Field Practices That Make Model Math Dependable.

Adapting the Checklist to Your Own Tasks

A generic checklist is a starting point; the version that actually serves you is one you have tuned to the calculations you do repeatedly.

Add Domain-Specific Constraints

The plausibility and constraint checks become far more powerful when you encode the rules of your own domain. A pricing task can check that a discounted price is never higher than list; a forecasting task can check that a projection does not exceed a known ceiling; a payroll task can check that withholdings never exceed gross. These domain rules catch errors that generic checks miss, because they encode knowledge only you have.

List the impossible outcomes for your task. Anything that simply cannot be true in your domain becomes a constraint check.
Encode the relationships that must hold. If two figures must sum to a third, or one must always exceed another, make that an explicit check.
Note the typical range for each output. A figure far outside its usual band is a flag worth investigating even when no hard constraint is violated.

Prune What You Never Use

A checklist nobody runs is worse than no checklist, and the fastest way to make one ignored is to bloat it. If certain items never catch anything for the tasks you actually do, drop them. The list should feel lean enough that running it is no burden, which is what keeps it in use. The discipline of tuning rather than accumulating is part of the encoding practice in The FRAME Method for Numerical Reasoning Prompts.

Turning the Checklist Into a Shared Standard

A checklist you run alone helps you; a checklist a team shares raises the floor for everyone and stops errors from depending on who happened to do the work.

Making It a Team Default

When several people produce numerical output, individual diligence is uneven, and the weakest moment determines the quality clients see. A shared checklist fixes the standard so the result does not depend on whoever was least careful that day:

Agree on which tier applies to which kind of work. Everyone should know that client-facing figures get the full list and internal estimates do not.
Bake the prompting checks into shared templates. If the standard prompts already request step-by-step reasoning and staged calculations, those items happen automatically.
Review against the list, not against taste. Checking work against named items is faster and fairer than vague judgments about whether a number looks right.

Keeping It Alive

A shared checklist decays if nobody maintains it. As models change and new task types appear, revisit the items, add domain constraints that proved useful, and prune the ones that never fire. Treating the list as a living document rather than a fixed artifact is what keeps it trusted and used. This maintenance mindset mirrors the encoding stage in The FRAME Method for Numerical Reasoning Prompts.

Frequently Asked Questions

Do I really need to run all twelve checks every time?

No. The full list is calibrated for numbers that carry real consequence. For casual estimates, the setup and prompting items catch the meaningful errors, and you can skip the heavier verification. The skill is dropping items deliberately based on stakes, not abandoning the whole list because it feels long.

Which checks matter most if I only do a few?

Requesting step-by-step reasoning and stating quantities unambiguously prevent the two most common failures, so start there. Adding a plausibility glance catches the worst remaining errors cheaply. Those three items alone cover a large share of what goes wrong, and they take almost no extra effort.

Why is computing numbers inside prose called out separately?

Because it is uniquely dangerous. A wrong figure embedded in fluent writing looks as trustworthy as the correct sentences around it, with no visual cue that it was the weak link. Computing and verifying such numbers outside the narrative, then folding in the confirmed result, removes that camouflage.

How does this checklist handle tasks without code execution?

Most items work regardless of tooling. The offload-arithmetic check becomes "compute the exact arithmetic in a spreadsheet or calculator yourself" when code execution is unavailable. The framing, reasoning, and verification checks do not depend on tools at all, so the list stays useful even in a plain chat interface.

When can I stop using the checklist?

Once the items have become automatic through repetition, you will run most of them without thinking, and the written list becomes a backstop for high-stakes work rather than a constant companion. That internalization is the goal. Until the practices are habit, keep the list visible while you work.

Key Takeaways

Treat the list as a working tool to run before trusting a number, not a reference to read once.
Setup checks prevent the most errors by removing ambiguity before any calculation happens.
Prompting checks make the model's reasoning reliable and auditable through visible steps, staged calculations, and tool offloading.
Verification checks scale with stakes, from a quick plausibility glance to independent recomputation for consequential figures.
Run the full list deliberately until the items become habit, then keep it as a backstop for high-stakes numbers.

Before You Ask: Setting Up the Problem

Most numerical errors are decided before any calculation happens, in how the problem is framed.

The Setup Checks

Every quantity has an explicit unit. Ambiguous units are a leading cause of wrong-but-confident answers.
Each number's meaning is stated. "The growth figure" means different things to you and the model unless you define it.
The expected answer format is specified. Telling the model whether you want dollars, a percentage, or a count keeps it from solving a different problem.
The problem reads unambiguously to a stranger. If someone with no context could interpret it two ways, so can the model.

These setup checks build on the framing discipline in Build a Repeatable Workflow for Math You Can Rely On.

While You Ask: Prompting for the Answer

How you request the calculation determines whether the model's reasoning is reliable and auditable.

The Prompting Checks

Step-by-step reasoning is requested. Visible steps turn one hard prediction into several easy ones and give you an audit trail.
Compound calculations are split into stages. One prompt doing four operations gives four invisible chances to fail.
The formula or approach is stated before the numbers. Separating logic from arithmetic catches method errors before they hide in a result.
Exact arithmetic is offloaded to code or a tool when available. Deterministic computation removes the model's core weakness.

The reasoning behind offloading arithmetic is in Getting Language Models to Do Math They Can Actually Trust.

After You Ask: Verifying the Result

A number is not trustworthy until it has survived a check proportional to its stakes.

The Verification Checks

The result passes a plausibility glance. An answer ten times too large or with the wrong sign is caught in seconds.
The result obeys obvious constraints. No negative counts, no percentages over 100 unless that genuinely makes sense.
High-stakes numbers are recomputed independently. Two methods that agree give real confidence; two that disagree caught an error.
Numbers inside prose are computed and verified separately. Figures buried in narrative hide their errors behind fluent writing.

The failure modes these checks guard against are catalogued in 7 Mistakes That Wreck Numerical Reasoning Prompts.

Running the Checklist by Stakes

The full list is for figures that matter; lighter work needs only part of it.

Tiering the Effort

Making It a Habit

Adapting the Checklist to Your Own Tasks

A generic checklist is a starting point; the version that actually serves you is one you have tuned to the calculations you do repeatedly.

Add Domain-Specific Constraints

List the impossible outcomes for your task. Anything that simply cannot be true in your domain becomes a constraint check.
Encode the relationships that must hold. If two figures must sum to a third, or one must always exceed another, make that an explicit check.
Note the typical range for each output. A figure far outside its usual band is a flag worth investigating even when no hard constraint is violated.

Prune What You Never Use

Turning the Checklist Into a Shared Standard

A checklist you run alone helps you; a checklist a team shares raises the floor for everyone and stops errors from depending on who happened to do the work.

Making It a Team Default

Agree on which tier applies to which kind of work. Everyone should know that client-facing figures get the full list and internal estimates do not.
Bake the prompting checks into shared templates. If the standard prompts already request step-by-step reasoning and staged calculations, those items happen automatically.
Review against the list, not against taste. Checking work against named items is faster and fairer than vague judgments about whether a number looks right.

Keeping It Alive

Frequently Asked Questions

Do I really need to run all twelve checks every time?

Which checks matter most if I only do a few?

Why is computing numbers inside prose called out separately?

How does this checklist handle tasks without code execution?

When can I stop using the checklist?

Key Takeaways

Treat the list as a working tool to run before trusting a number, not a reference to read once.
Setup checks prevent the most errors by removing ambiguity before any calculation happens.
Prompting checks make the model's reasoning reliable and auditable through visible steps, staged calculations, and tool offloading.
Verification checks scale with stakes, from a quick plausibility glance to independent recomputation for consequential figures.
Run the full list deliberately until the items become habit, then keep it as a backstop for high-stakes numbers.

Vetting a Model's Numbers Before You Rely on Them

Before You Ask: Setting Up the Problem

The Setup Checks

While You Ask: Prompting for the Answer

The Prompting Checks

After You Ask: Verifying the Result

The Verification Checks

Running the Checklist by Stakes

Tiering the Effort

Making It a Habit

Adapting the Checklist to Your Own Tasks

Add Domain-Specific Constraints

Prune What You Never Use

Turning the Checklist Into a Shared Standard

Making It a Team Default

Keeping It Alive

Frequently Asked Questions

Do I really need to run all twelve checks every time?

Which checks matter most if I only do a few?

Why is computing numbers inside prose called out separately?

How does this checklist handle tasks without code execution?

When can I stop using the checklist?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Vetting a Model's Numbers Before You Rely on Them

Before You Ask: Setting Up the Problem

The Setup Checks

While You Ask: Prompting for the Answer

The Prompting Checks

After You Ask: Verifying the Result

The Verification Checks

Running the Checklist by Stakes

Tiering the Effort

Making It a Habit

Adapting the Checklist to Your Own Tasks

Add Domain-Specific Constraints

Prune What You Never Use

Turning the Checklist Into a Shared Standard

Making It a Team Default

Keeping It Alive

Frequently Asked Questions

Do I really need to run all twelve checks every time?

Which checks matter most if I only do a few?

Why is computing numbers inside prose called out separately?

How does this checklist handle tasks without code execution?

When can I stop using the checklist?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?