A sequential decision prompt either reaches the goal or it does not, so it is tempting to measure only that: success rate. But final success is a lagging, low-resolution signal. It tells you the chain failed without telling you where, and two chains with the same success rate can differ wildly in cost, reliability, and how close the failures came to being catastrophic. If success rate is all you measure, you are flying with one instrument.
The point of measuring sequential decision making is to localize quality and failure to specific steps and specific causes. That requires step-level instrumentation, not just outcome capture, and it requires KPIs chosen to expose the failure modes these chains actually have — drift, premature commitment, looping, and silent compounding error.
This article defines the metrics worth tracking, how to instrument them without crushing your latency budget, and how to read the resulting signal. The aim is a dashboard that points at the problem rather than merely confirming there is one.
Outcome Metrics: Necessary but Not Sufficient
Start with the obvious metrics, but treat them as the beginning of measurement rather than the end.
The Core Outcome KPIs
- Goal success rate. Did the chain reach a correct terminal state? The headline number, and the least diagnostic.
- Cost per resolved task. Total inference and tool cost divided by successful completions. A chain that succeeds expensively is not winning.
- Steps to completion. How many decisions the chain took. Rising step counts often signal drift before success rate moves.
Why They Are Not Enough
Outcome metrics aggregate away the place where the chain went wrong. A 90 percent success rate hides whether the 10 percent failed early and cheaply or late and expensively after touching real systems.
Step-Level Quality Metrics
This is where the real diagnostic value lives. Grade individual decisions, not just final outcomes.
Per-Decision Metrics
- Decision correctness. On known cases, was each individual action the right one? This localizes which step type fails.
- Rationale quality. Did the model's stated reason actually justify its action? A right action for a wrong reason is fragile.
- Information sufficiency. Did the model have the facts it needed before acting, or did it commit prematurely? This is the metric that catches the most common failure mode, the one the Vetting Each Step Before You Chain Decision Prompts targets.
Grading at Scale
- Use known-answer cases for objective per-step grading.
- Use model-assisted grading for subjective dimensions like rationale quality, calibrated against human labels.
Reliability and Failure-Mode Metrics
Sequential chains have characteristic ways of breaking. Measure each directly.
The Failure Modes to Track
- Drift rate. How often the chain loses the goal mid-run. Detect by checking whether re-stated goals match the original.
- Loop rate. How often chains hit the step budget without resolving. A rising loop rate means stop conditions are weak.
- Premature-action rate. How often the model acts before the sufficiency check passes.
- Recovery rate. When the chain makes an error, how often it backtracks rather than rationalizing forward. The patterns behind recovery are in Edge Cases That Break Long Decision-Prompt Chains.
Instrumentation Without Wrecking Latency
You cannot measure what you do not capture, but heavy capture degrades the thing you are measuring.
How to Instrument
- Log structured step records. Each step emits its stage, decision, rationale, and result as structured data, not free text.
- Sample deep traces. Capture full detail on a sample rather than every run to control overhead.
- Tag chains for retrieval. So you can find the failing chain among thousands. The tooling categories for this are surveyed in Which Software Actually Helps You Orchestrate Decision Prompts.
Reading the Signal
Metrics are only useful if you know what to do when they move.
Interpreting Common Patterns
- Success flat, steps rising. Drift is creeping in; re-grounding is weakening. Strengthen the Orient stage.
- Success flat, cost rising. The chain is succeeding less efficiently — often more retries or longer chains. Investigate before it gets worse.
- High loop rate. Stop conditions are ambiguous. Tighten success and failure exits.
- Low recovery rate. The chain rationalizes errors forward. Strengthen the Verify stage and add explicit backtrack permission.
Avoiding the Metrics That Mislead
Some metrics actively point you the wrong way. Knowing which to distrust is as important as knowing which to track.
Vanity and Trap Metrics
- Raw step count alone. Fewer steps is not better if it means the chain skipped necessary information gathering. Read step count alongside success and premature-action rate, never on its own.
- Aggregate success without segmentation. A healthy overall number can hide a category of inputs that fails completely. Segment by input type so a failing subgroup is not averaged into invisibility.
- Model-graded scores without calibration. Letting a model grade chain quality is useful but seductive. Without periodic checks against human labels, the grader can drift and you optimize toward its blind spots rather than real quality.
Building a Dashboard That Points at Problems
- Pair every outcome metric with a diagnostic one. Success with information-sufficiency, cost with steps-to-completion. Pairs localize; singletons only alarm.
- Set a baseline and watch deltas. Absolute numbers matter less than movement. A metric trending the wrong way is a signal even when its current value looks acceptable.
- Connect metrics to the design they reflect. A rising drift rate maps to the Orient stage; a rising loop rate maps to stop conditions; a low recovery rate maps to Verify. Metrics that route to a fixable stage drive improvement; metrics that float free only generate anxiety.
Frequently Asked Questions
What is the single most important metric?
If forced to one, cost per resolved task, because it combines correctness and efficiency. But it is a poor diagnostic alone. The most useful single addition beyond outcomes is information sufficiency, since premature commitment is the dominant failure mode and that metric catches it directly.
How do I grade individual decisions objectively?
Maintain a set of cases where you know the correct action at each step, and grade the chain's decisions against them. For subjective dimensions like rationale quality, use model-assisted grading calibrated against a sample of human labels, and re-check that calibration periodically.
Won't step-level logging slow everything down?
Structured logging of decisions and rationales is cheap. The expensive part is full deep traces, which you should sample rather than capture on every run. Sampling gives you the diagnostic depth you need without paying the overhead on every chain.
How do I detect drift?
Have the model restate the goal periodically and compare it to the original. When restated goals diverge, drift is occurring. A rising divergence rate across runs is your drift metric, and it usually moves before success rate does.
What is a good loop rate?
Low and stable. There is no universal number, but a rising loop rate is the signal that matters — it means chains increasingly hit the budget without resolving, which points at weak stop conditions rather than hard problems.
How often should I review these metrics?
Watch outcome and cost metrics continuously in production. Review step-level and failure-mode metrics whenever you change a prompt, and on a regular cadence regardless, because chains can degrade as inputs shift even when the prompt is unchanged.
Should I instrument every step or only the final outcome?
Instrument both, but treat them differently. Final-outcome metrics tell you whether the chain delivered value; step-level metrics tell you why it did or did not. If you only measure the outcome, a chain that succeeds for the wrong reasons — getting lucky despite a broken middle step — looks identical to one that succeeds by sound reasoning, and you will not notice the fragility until the inputs shift. Start with outcome metrics so you know whether there is a problem at all, then layer in per-step success, latency, and token cost for the steps that sit on the critical path. Steps that are cheap, reliable, and rarely implicated in failures can be sampled rather than measured on every run, which keeps your instrumentation overhead proportional to the risk each step actually carries.
Key Takeaways
- Final success rate is a lagging, low-resolution signal; it confirms failure without localizing it.
- Track outcome metrics (success, cost per resolved task, steps to completion) but treat them as a starting point.
- Step-level metrics — decision correctness, rationale quality, information sufficiency — localize where chains fail.
- Measure characteristic failure modes directly: drift rate, loop rate, premature-action rate, recovery rate.
- Instrument with structured step records and sampled deep traces to avoid wrecking latency.
- Read the signal by pattern: rising steps means drift, rising cost means inefficiency, high loop rate means weak stop conditions.