Teams that adopt iterative prompting usually measure the wrong thing. They count outputs produced and celebrate the volume, while the metric that actually reflects whether the practice is working—how much effort it takes to reach a usable result—stays invisible. As a consequence, they cannot tell a healthy loop from a wasteful one, and they optimize blindly.
Good measurement here is not about dashboards. It is about tracking a few honest numbers that reveal whether your loops are converging quickly, whether quality is real or merely felt, and where effort is leaking. You can capture most of these with a spreadsheet and a habit, no platform required.
This article defines the metrics worth tracking, how to instrument each one cheaply, and how to interpret the signal. It assumes you are already running structured loops via the Draft-Diagnose-Constrain method; metrics tell you whether that process is paying off.
One principle runs through everything below: a metric earns its place only if it changes a decision. There is no value in a number you watch and never act on. So for each metric, we name not just what it measures but what you do differently when it moves. Measurement that does not drive action is just overhead dressed up as rigor, and a small set of decision-driving numbers beats a sprawling dashboard every time.
The Metric Most Teams Miss
Passes to Acceptance
The number of refinement turns it takes to reach a usable output is the truest health signal for a loop. A loop that resolves in one or two passes is converging; one that routinely takes six is spiraling. Track the average and watch the trend.
Why It Beats Volume
Output volume can rise while passes-to-acceptance also rises, which means you are producing more drafts at higher total cost. The team in How a Three-Person Editorial Team Rebuilt Its Workflow Around Refinement Loops discovered exactly this: volume looked healthy while total effort had barely improved.
Quality Metrics
Acceptance Rate on First Pass
What fraction of first drafts are usable with zero refinement? This isolates the quality of your starting prompt from the quality of your loop. A rising first-pass acceptance rate means your draft prompts are getting better.
Defect Recurrence
Are the same defects showing up loop after loop? If "unsupported claim" appears in your diagnose stage every time, that is a signal to fix the starting prompt, not to keep catching it in refinement. Recurring defects are prompts asking to be improved.
Holding Against the Bar
Of outputs you called done, how many actually met your defined quality bar on later review? A gap here means your in-the-moment "done" judgment is too loose, and your stopping rule needs tightening.
Effort and Cost Metrics
Time to Acceptance
Total minutes from first prompt to usable output. This is the metric clients and managers care about, and it is the one that improves when loops get tighter. Passes-to-acceptance is the cause; time-to-acceptance is the effect.
Effort per Turn
Some turns are quick nudges; some require careful diagnosis. If your turns are getting heavier without fewer of them, you may be over-refining. The selection logic in Iterate, Restart, or Rewrite the Prompt When Output Disappoints helps you cut wasteful turns.
How to Instrument Cheaply
A Spreadsheet and a Habit
For each task, log three numbers: passes to acceptance, whether the first pass was usable, and the dominant defect you diagnosed. That is enough to compute every metric above. The discipline is logging consistently, not the tool.
Sampling Beats Tracking Everything
You do not need to log every task. A representative sample—say one in five—reveals trends without turning measurement into a second job. The goal is signal, not surveillance.
Reading the Signal
Healthy Loop
First-pass acceptance is climbing, passes-to-acceptance is low and stable, and defect recurrence is falling. This is a process compounding: better prompts feeding tighter loops.
Warning Signs
Passes-to-acceptance creeping up, the same defect recurring, or a gap between "called done" and "actually met the bar." Each points to a specific fix—usually a starting-prompt improvement or a tighter stopping rule, not more refinement.
The Trap
Optimizing volume alone. If the only number you watch is drafts produced, you can make every other metric worse while feeling productive. Always pair a volume metric with an effort metric.
Start Small
You do not need all of these metrics on day one. Begin with the single most diagnostic number—passes to acceptance—logged on a sample of tasks for two weeks. That alone will tell you whether your loops are healthy or spiraling, and it will surface your most common defect almost immediately. Add first-pass acceptance and the done-holds check only once the basic number is part of your habit. A measurement practice you actually sustain on one metric beats an ambitious dashboard you abandon after a week.
Turning Metrics Into Action
When Passes-to-Acceptance Climbs
If the trend line rises, do not respond by refining harder—that treats the symptom. Look at your defect log. A climbing pass count usually means your starting prompts have degraded, or the tasks have gotten harder without your draft prompts adapting. The fix lives in the draft stage, not the loop.
When First-Pass Acceptance Stalls
A flat first-pass rate means your draft prompts have stopped improving. The remedy is to harvest your recurring defects: whatever you diagnose most often should become a standing constraint in your starting prompt, so the model stops producing that defect in the first place. This is the single highest-leverage move the metrics reveal.
When Done Does Not Hold
If outputs you called done keep failing later review, your in-the-moment judgment is too loose. Tighten the stopping rule into something checkable—a concrete bar rather than a feeling—so "done" means the same thing every time. This connects directly to the discipline of defining done in the Draft-Diagnose-Constrain method.
Common Measurement Mistakes
Tracking Everything
Logging every task turns measurement into a second job and the habit collapses. Sample instead—one task in five is plenty to reveal a trend. The goal is signal, not a complete record.
Confusing Activity With Progress
A busy dashboard full of draft counts feels productive and tells you almost nothing about whether your loops are healthy. Always pair any volume number with an effort number like passes-to-acceptance, or you will optimize the wrong thing.
Measuring Without Acting
The point of these numbers is to change behavior—improve a draft prompt, tighten a stopping rule, cut a wasteful turn. A metric you watch but never act on is overhead. If a number is not changing a decision, stop tracking it. The actions these metrics drive also feed the ROI case in Putting a Dollar Figure on Tighter AI Refinement Loops.
Frequently Asked Questions
What is the single most important metric for refinement loops?
Passes to acceptance—the number of turns it takes to reach a usable output. It is the truest signal of whether a loop is converging or spiraling, and it is the metric most teams overlook in favor of raw volume.
How is passes-to-acceptance different from time-to-acceptance?
Passes is the cause; time is the effect. Fewer, tighter loops produce a usable result in less total time. Track passes to diagnose loop health, and report time-to-acceptance to managers and clients, who care about the elapsed effort.
What does a recurring defect tell me?
That your starting prompt, not your loop, needs work. If the same defect appears in your diagnose stage every task, stop catching it in refinement and add a constraint to the draft prompt that prevents it.
Do I need special tooling to track these?
No. A spreadsheet logging three numbers per task—passes to acceptance, first-pass usable or not, and the dominant defect—covers every metric here. Consistent logging matters far more than the tool.
Why is measuring volume alone a trap?
Because volume can rise while effort per usable output rises too, meaning you are working harder for the same result. Always pair a volume metric with an effort metric like passes-to-acceptance so you see the full picture.
Key Takeaways
- Passes to acceptance is the truest health signal for a loop; most teams wrongly track volume instead.
- First-pass acceptance rate isolates draft-prompt quality from loop quality.
- Recurring defects signal a starting-prompt fix, not more refinement.
- A gap between "called done" and "actually met the bar" means your stopping rule is too loose.
- A spreadsheet logging three numbers per task captures every metric; sample rather than track everything.