Loose advice about confidence — "ask the model to be honest" — falls apart under pressure because there is no structure to fall back on when an answer looks wrong. A named, repeatable model gives you that structure. This guide presents one: the GROUND method, five stages you apply in order to make a model's expressed confidence track its actual reliability. The name is a mnemonic and also the core idea, since the whole point is to ground confidence in evidence rather than tone.
The five stages are Gauge the stakes, Restrict to evidence, Order the reasoning, Uncover uncertainty, and Numb the over-hedge, then Demonstrate with measurement. Each stage exists because a specific failure mode exists, and each tells you what to do and when it matters most. You do not invent these moves fresh for every task; you run the stages, adapting the wording to your domain.
Use GROUND as scaffolding, not scripture. On low-stakes tasks you might compress the later stages; on high-stakes ones you run all of them rigorously. The value is having a consistent structure so calibration becomes a process you repeat rather than a trick you remember.
Stage 1: Gauge the Stakes
Everything downstream depends on knowing what a confident error would cost.
Why this is first
Calibration is relative to consequences. The same overconfidence that is harmless in a brainstorm is dangerous in a contract review. Without setting stakes, you cannot decide how aggressively to bias toward explicit uncertainty.
What to do
- Name the worst-case outcome of a confidently wrong answer.
- Set how strongly the prompt should push toward abstention accordingly.
High stakes mean you accept more hedging in exchange for fewer confident errors. Low stakes let you tolerate a looser hand.
Stage 2: Restrict to Evidence
Tie confidence to something traceable rather than to how the answer sounds.
Why this matters most
A model's default confidence is borrowed fluency. Until you anchor it to evidence, any self-rating is built on tone. This stage is the heart of the method — it is why the mnemonic is GROUND.
What to do
- Identify the evidence source: context, documents, or code execution.
- Instruct the model to mark any claim it cannot trace to that source as low confidence.
The grounding move is also the through-line of the real-world examples, where quoting source text repeatedly collapsed fabrications.
Stage 3: Order the Reasoning
Sequence the prompt so reasoning comes before the verdict.
Why order changes meaning
Confidence rated after a committed answer rationalizes that answer. Confidence rated after weighing both sides reflects real support. The ordering is not cosmetic; it determines what the rating means.
What to do
- Ask for the case for and the case against before any answer.
- Place the answer and its confidence label last.
The "case against" step is the one most prompts skip, and it does the most work. The how-to process builds this ordering into its prompt template.
Stage 4: Uncover Uncertainty, Then Numb the Over-Hedge
Surface genuine uncertainty without letting the model hide behind blanket hedging.
The two-sided risk
Models err in both directions: faking certainty, and qualifying everything into uselessness. This stage balances them.
What to do
- Grant explicit permission to abstain — "I cannot determine this" is allowed and preferred when evidence is thin.
- Require per-claim confidence labels so a solid fact does not mask a shaky inference.
- Then guard against over-hedging by testing on easy items the model should answer confidently.
Both failure directions are catalogued in the common mistakes guide; this stage handles them together.
Stage 5: Demonstrate With Measurement
Prove the calibration works against known answers before you trust it.
Why this closes the loop
Every prior stage can fail silently. Measurement is the only thing that confirms the labels carry information. Without it, GROUND is theater.
What to do
- Run a test set with ground truth, capturing a baseline first.
- Confirm high-confidence answers are reliably right and errors cluster in the low-confidence band.
- Re-run whenever the model, domain, or prompt changes.
This stage is what makes the method honest, and it connects directly to the release checklist.
Walking GROUND Through a Single Task
Stages are abstract until you see them run on one problem end to end. Take a model summarizing a policy document for a compliance reviewer.
The five stages in motion
- Gauge: a confidently wrong policy claim could mislead a compliance decision, so stakes are high and the prompt biases hard toward explicit uncertainty.
- Restrict: the evidence source is the policy text itself; the model must quote the passage behind each claim or mark it "not in document," low confidence.
- Order: for any contested interpretation, the model lays out the supporting and the conflicting language before stating its reading and a confidence level.
- Uncover and numb: the model may answer "the document does not specify this" on genuine gaps, but a test set of clearly-answered questions confirms it still commits where the text is plain.
- Demonstrate: twenty questions with known answers, including some the document cannot answer, prove the high-confidence claims are reliable and the gaps surface as low confidence.
The result is a summary a reviewer can trust at a glance: confident claims carry quotes, soft claims are flagged, and gaps are named rather than papered over. Run loosely, this same sequence handles a low-stakes draft in seconds.
When to Compress the Method
GROUND is scaffolding, and scaffolding can be lightened when the load is small.
Matching depth to stakes
- Low-stakes, throwaway work: Gauge tells you the stakes are minor, so you might run only Restrict and Uncover — ground the claims, allow abstention — and skip formal measurement.
- Recurring, moderate-stakes work: run all five but keep the test set small, treating Demonstrate as a quick spot-check.
- High-stakes or production work: run every stage rigorously, with a full test set and the measurement stage as a hard release gate.
The mistake is running GROUND at maximum depth everywhere, which makes it feel heavy and tempts people to abandon it. Scaling the method to the stakes is what keeps it a habit rather than a chore, and it echoes the staged effort the trade-offs guide recommends for choosing methods at all.
Frequently Asked Questions
Why does the framework spell out GROUND specifically?
Because the mnemonic encodes the central principle: grounding confidence in evidence rather than tone. The letters map to the stages — Gauge stakes, Restrict to evidence, Order reasoning, Uncover and numb over-hedging, Demonstrate with measurement — so the name doubles as both a memory aid and a statement of what calibration fundamentally requires.
Do I have to run all five stages every time?
Not literally. On low-stakes tasks you can compress the later stages, accepting lighter validation. On high-stakes work you run all five rigorously, because a confident error there is costly. The framework's value is a consistent structure to adapt, not a mandate to perform every stage at full depth regardless of context.
Which stage does the most work?
Restricting to evidence and ordering the reasoning together carry most of the improvement. Grounding breaks the link between sounding sure and being sure, and reasoning-first stops the model from rationalizing an answer it already committed to. The other stages set context, balance over-hedging, and verify, but those two change what the confidence rating actually reflects.
How is this different from just listing best practices?
The framework sequences the practices and ties each to the failure mode it addresses, so you apply them in an order that builds on itself. A flat list of best practices leaves you deciding what to do first; GROUND gives you stages with a rationale and a stopping point, which is what makes calibration repeatable rather than improvised.
What happens if I skip the measurement stage?
You get something that looks calibrated but is unverified. Every earlier stage can pass on the surface while the labels fail to track correctness. Skipping measurement means trusting confidence you have never tested, which is the exact risk calibration is meant to remove. Measurement is what turns the method from theater into a reliable process.
Can the framework handle the over-hedging problem?
Yes, that is the second half of stage four. After granting permission to abstain, you guard against the model qualifying everything by testing on easy questions it should answer confidently. If those come back hedged, you tighten the instruction. The framework deliberately addresses both overconfidence and over-hedging rather than only the more famous one.
Key Takeaways
- GROUND is a five-stage method: Gauge stakes, Restrict to evidence, Order reasoning, Uncover uncertainty while numbing over-hedge, Demonstrate with measurement.
- The mnemonic encodes the core idea — ground confidence in evidence, not in how the answer sounds.
- Restricting to evidence and ordering reasoning before the verdict carry most of the improvement.
- Stage four balances both failure directions: faking certainty and hedging everything into uselessness.
- Measurement against known answers is what turns the framework from theater into a reliable process.
- Adapt the depth to the stakes, but keep the stages and re-run when the model, domain, or prompt changes.