Which Numbers Tell You a Refinement Loop Is Actually Healthy

Teams that adopt iterative prompting usually measure the wrong thing. They count outputs produced and celebrate the volume, while the metric that actually reflects whether the practice is working—how much effort it takes to reach a usable result—stays invisible. As a consequence, they cannot tell a healthy loop from a wasteful one, and they optimize blindly.

Good measurement here is not about dashboards. It is about tracking a few honest numbers that reveal whether your loops are converging quickly, whether quality is real or merely felt, and where effort is leaking. You can capture most of these with a spreadsheet and a habit, no platform required.

This article defines the metrics worth tracking, how to instrument each one cheaply, and how to interpret the signal. It assumes you are already running structured loops via the Draft-Diagnose-Constrain method; metrics tell you whether that process is paying off.

One principle runs through everything below: a metric earns its place only if it changes a decision. There is no value in a number you watch and never act on. So for each metric, we name not just what it measures but what you do differently when it moves. Measurement that does not drive action is just overhead dressed up as rigor, and a small set of decision-driving numbers beats a sprawling dashboard every time.

The Metric Most Teams Miss

Passes to Acceptance

The number of refinement turns it takes to reach a usable output is the truest health signal for a loop. A loop that resolves in one or two passes is converging; one that routinely takes six is spiraling. Track the average and watch the trend.

Why It Beats Volume

Output volume can rise while passes-to-acceptance also rises, which means you are producing more drafts at higher total cost. The team in How a Three-Person Editorial Team Rebuilt Its Workflow Around Refinement Loops discovered exactly this: volume looked healthy while total effort had barely improved.

Quality Metrics

Acceptance Rate on First Pass

What fraction of first drafts are usable with zero refinement? This isolates the quality of your starting prompt from the quality of your loop. A rising first-pass acceptance rate means your draft prompts are getting better.

Defect Recurrence

Are the same defects showing up loop after loop? If "unsupported claim" appears in your diagnose stage every time, that is a signal to fix the starting prompt, not to keep catching it in refinement. Recurring defects are prompts asking to be improved.

Holding Against the Bar

Of outputs you called done, how many actually met your defined quality bar on later review? A gap here means your in-the-moment "done" judgment is too loose, and your stopping rule needs tightening.

Effort and Cost Metrics

Time to Acceptance

Total minutes from first prompt to usable output. This is the metric clients and managers care about, and it is the one that improves when loops get tighter. Passes-to-acceptance is the cause; time-to-acceptance is the effect.

Effort per Turn

Some turns are quick nudges; some require careful diagnosis. If your turns are getting heavier without fewer of them, you may be over-refining. The selection logic in Iterate, Restart, or Rewrite the Prompt When Output Disappoints helps you cut wasteful turns.

How to Instrument Cheaply

A Spreadsheet and a Habit

For each task, log three numbers: passes to acceptance, whether the first pass was usable, and the dominant defect you diagnosed. That is enough to compute every metric above. The discipline is logging consistently, not the tool.

Sampling Beats Tracking Everything

You do not need to log every task. A representative sample—say one in five—reveals trends without turning measurement into a second job. The goal is signal, not surveillance.

Reading the Signal

Healthy Loop

First-pass acceptance is climbing, passes-to-acceptance is low and stable, and defect recurrence is falling. This is a process compounding: better prompts feeding tighter loops.

Warning Signs

Passes-to-acceptance creeping up, the same defect recurring, or a gap between "called done" and "actually met the bar." Each points to a specific fix—usually a starting-prompt improvement or a tighter stopping rule, not more refinement.

The Trap

Optimizing volume alone. If the only number you watch is drafts produced, you can make every other metric worse while feeling productive. Always pair a volume metric with an effort metric.

Start Small

You do not need all of these metrics on day one. Begin with the single most diagnostic number—passes to acceptance—logged on a sample of tasks for two weeks. That alone will tell you whether your loops are healthy or spiraling, and it will surface your most common defect almost immediately. Add first-pass acceptance and the done-holds check only once the basic number is part of your habit. A measurement practice you actually sustain on one metric beats an ambitious dashboard you abandon after a week.

Turning Metrics Into Action

When Passes-to-Acceptance Climbs

If the trend line rises, do not respond by refining harder—that treats the symptom. Look at your defect log. A climbing pass count usually means your starting prompts have degraded, or the tasks have gotten harder without your draft prompts adapting. The fix lives in the draft stage, not the loop.

When First-Pass Acceptance Stalls

A flat first-pass rate means your draft prompts have stopped improving. The remedy is to harvest your recurring defects: whatever you diagnose most often should become a standing constraint in your starting prompt, so the model stops producing that defect in the first place. This is the single highest-leverage move the metrics reveal.

When Done Does Not Hold

If outputs you called done keep failing later review, your in-the-moment judgment is too loose. Tighten the stopping rule into something checkable—a concrete bar rather than a feeling—so "done" means the same thing every time. This connects directly to the discipline of defining done in the Draft-Diagnose-Constrain method.

Common Measurement Mistakes

Tracking Everything

Logging every task turns measurement into a second job and the habit collapses. Sample instead—one task in five is plenty to reveal a trend. The goal is signal, not a complete record.

Confusing Activity With Progress

A busy dashboard full of draft counts feels productive and tells you almost nothing about whether your loops are healthy. Always pair any volume number with an effort number like passes-to-acceptance, or you will optimize the wrong thing.

Measuring Without Acting

The point of these numbers is to change behavior—improve a draft prompt, tighten a stopping rule, cut a wasteful turn. A metric you watch but never act on is overhead. If a number is not changing a decision, stop tracking it. The actions these metrics drive also feed the ROI case in Putting a Dollar Figure on Tighter AI Refinement Loops.

Frequently Asked Questions

What is the single most important metric for refinement loops?

Passes to acceptance—the number of turns it takes to reach a usable output. It is the truest signal of whether a loop is converging or spiraling, and it is the metric most teams overlook in favor of raw volume.

How is passes-to-acceptance different from time-to-acceptance?

Passes is the cause; time is the effect. Fewer, tighter loops produce a usable result in less total time. Track passes to diagnose loop health, and report time-to-acceptance to managers and clients, who care about the elapsed effort.

What does a recurring defect tell me?

That your starting prompt, not your loop, needs work. If the same defect appears in your diagnose stage every task, stop catching it in refinement and add a constraint to the draft prompt that prevents it.

Do I need special tooling to track these?

No. A spreadsheet logging three numbers per task—passes to acceptance, first-pass usable or not, and the dominant defect—covers every metric here. Consistent logging matters far more than the tool.

Why is measuring volume alone a trap?

Because volume can rise while effort per usable output rises too, meaning you are working harder for the same result. Always pair a volume metric with an effort metric like passes-to-acceptance so you see the full picture.

Key Takeaways

Passes to acceptance is the truest health signal for a loop; most teams wrongly track volume instead.
First-pass acceptance rate isolates draft-prompt quality from loop quality.
Recurring defects signal a starting-prompt fix, not more refinement.
A gap between "called done" and "actually met the bar" means your stopping rule is too loose.
A spreadsheet logging three numbers per task captures every metric; sample rather than track everything.

The Metric Most Teams Miss

Passes to Acceptance

Why It Beats Volume

Quality Metrics

Acceptance Rate on First Pass

Defect Recurrence

Holding Against the Bar

Of outputs you called done, how many actually met your defined quality bar on later review? A gap here means your in-the-moment "done" judgment is too loose, and your stopping rule needs tightening.

Effort and Cost Metrics

Time to Acceptance

Effort per Turn

How to Instrument Cheaply

A Spreadsheet and a Habit

Sampling Beats Tracking Everything

You do not need to log every task. A representative sample—say one in five—reveals trends without turning measurement into a second job. The goal is signal, not surveillance.

Reading the Signal

Healthy Loop

First-pass acceptance is climbing, passes-to-acceptance is low and stable, and defect recurrence is falling. This is a process compounding: better prompts feeding tighter loops.

Warning Signs

The Trap

Optimizing volume alone. If the only number you watch is drafts produced, you can make every other metric worse while feeling productive. Always pair a volume metric with an effort metric.

Start Small

Turning Metrics Into Action

When Passes-to-Acceptance Climbs

When First-Pass Acceptance Stalls

When Done Does Not Hold

Common Measurement Mistakes

Tracking Everything

Logging every task turns measurement into a second job and the habit collapses. Sample instead—one task in five is plenty to reveal a trend. The goal is signal, not a complete record.

Confusing Activity With Progress

Measuring Without Acting

Frequently Asked Questions

What is the single most important metric for refinement loops?

How is passes-to-acceptance different from time-to-acceptance?

What does a recurring defect tell me?

Do I need special tooling to track these?

Why is measuring volume alone a trap?

Key Takeaways

Passes to acceptance is the truest health signal for a loop; most teams wrongly track volume instead.
First-pass acceptance rate isolates draft-prompt quality from loop quality.
Recurring defects signal a starting-prompt fix, not more refinement.
A gap between "called done" and "actually met the bar" means your stopping rule is too loose.
A spreadsheet logging three numbers per task captures every metric; sample rather than track everything.

Which Numbers Tell You a Refinement Loop Is Actually Healthy

The Metric Most Teams Miss

Passes to Acceptance

Why It Beats Volume

Quality Metrics

Acceptance Rate on First Pass

Defect Recurrence

Holding Against the Bar

Effort and Cost Metrics

Time to Acceptance

Effort per Turn

How to Instrument Cheaply

A Spreadsheet and a Habit

Sampling Beats Tracking Everything

Reading the Signal

Healthy Loop

Warning Signs

The Trap

Start Small

Turning Metrics Into Action

When Passes-to-Acceptance Climbs

When First-Pass Acceptance Stalls

When Done Does Not Hold

Common Measurement Mistakes

Tracking Everything

Confusing Activity With Progress

Measuring Without Acting

Frequently Asked Questions

What is the single most important metric for refinement loops?

How is passes-to-acceptance different from time-to-acceptance?

What does a recurring defect tell me?

Do I need special tooling to track these?

Why is measuring volume alone a trap?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Which Numbers Tell You a Refinement Loop Is Actually Healthy

The Metric Most Teams Miss

Passes to Acceptance

Why It Beats Volume

Quality Metrics

Acceptance Rate on First Pass

Defect Recurrence

Holding Against the Bar

Effort and Cost Metrics

Time to Acceptance

Effort per Turn

How to Instrument Cheaply

A Spreadsheet and a Habit

Sampling Beats Tracking Everything

Reading the Signal

Healthy Loop

Warning Signs

The Trap

Start Small

Turning Metrics Into Action

When Passes-to-Acceptance Climbs

When First-Pass Acceptance Stalls

When Done Does Not Hold

Common Measurement Mistakes

Tracking Everything

Confusing Activity With Progress

Measuring Without Acting

Frequently Asked Questions

What is the single most important metric for refinement loops?

How is passes-to-acceptance different from time-to-acceptance?

What does a recurring defect tell me?

Do I need special tooling to track these?

Why is measuring volume alone a trap?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?