Shipped an Agent and Can't Tell If It Works?

Teams ship AI agents and then discover they have no idea whether the agent is working. The model returns plausible text, the demo looked good, and now it is in production making decisions that nobody can audit. The problem is not that agents are unmeasurable. The problem is that most teams measure the wrong things, or measure nothing at all until something breaks loudly.

An AI agent is a system that loops — model decides, tool acts, result returns, repeat. That loop generates a rich stream of observable events, and each event is a measurement opportunity. The trick is knowing which measurements predict real outcomes and which are vanity.

This guide defines the metrics that matter for agentic systems, explains how to instrument them without drowning in noise, and shows how to read the signal so you can tell improvement from regression.

Outcome Metrics Come First

Before any internal metric, decide what success means for the task. Everything else is diagnostic.

Task success rate

The single most important number: what fraction of runs actually accomplished the goal? This requires a definition of "accomplished" that a human or a reliable check can verify. If you cannot define success, you cannot measure the agent, and you should stop and fix that before anything else.

Resolution without escalation

For agents that hand off to humans, measure how often the agent resolves the task on its own versus escalating. A rising escalation rate is an early warning that the agent is hitting cases it cannot handle.

Quality of outcome

Success is binary; quality is graded. A support agent might resolve a ticket but with a tone that annoys the customer. Sample outputs and grade them on a rubric, ideally with a second model or a human as judge.

If you are still establishing what good looks like, our best practices guide covers how to define success criteria before you build.

Process Metrics Diagnose the Loop

Outcome metrics tell you whether the agent works. Process metrics tell you why.

Steps per task. How many loop iterations did the agent take? A creeping average signals the agent is struggling or looping unnecessarily.
Tool call accuracy. When the agent calls a tool, did it pick the right tool with valid arguments? Failed or malformed tool calls are a leading cause of wasted steps.
Loop termination behavior. Does the agent stop when done, or does it run to the step cap? Hitting the cap is rarely a good sign.
Recovery rate. When a tool returns an error, does the agent recover gracefully or spiral? This separates robust agents from brittle ones.

These process metrics are where most debugging happens. The step-by-step guide shows how the loop produces each of these signals.

Cost and Latency Are Not Optional

Agents can be expensive and slow, and both compound at scale.

Cost per task

Track total token spend per completed task, not per model call. An agent that takes twenty calls to do a job costs ten times one that takes two. Cost per task is the number your finance team will ask about, so instrument it from day one.

Latency distribution

Report the full distribution, not the average. Agents have long tails — most tasks finish fast, but a few spin for many steps. The 95th percentile latency is what your slowest users actually feel, and it is usually far worse than the mean.

Cost-to-success ratio

Combine the two: average cost among successful tasks. This catches the trap where you improve success rate by letting the agent take more steps, quietly tripling cost.

How to Instrument Without Drowning

Measurement is worthless if it is too noisy to read or too sparse to trust.

Log every loop iteration as a structured event

Each step should emit the model's chosen action, the tool called, the arguments, the result, and a timestamp. Structured logs let you reconstruct any run after the fact, which is essential when a failure is hard to reproduce.

Attach a trace ID to every task

A single task spans many model calls. A trace ID ties them together so you can see the whole loop as one unit. Without this, you are staring at disconnected calls with no story.

Sample for human review

You cannot grade every output by hand. Sample a fixed percentage, plus all failures and all escalations. This gives you a stable quality signal without overwhelming reviewers. Our piece on measuring trade-offs explains why sampling beats exhaustive review.

Reading the Signal

Numbers only help if you interpret them correctly.

Watch trends, not snapshots. A 90 percent success rate means nothing in isolation; a success rate falling from 90 to 80 over a week means everything.
Segment by input type. Aggregate metrics hide problems. Break down by task category, customer tier, or input length to find where the agent struggles.
Correlate process with outcome. When success drops, check whether steps-per-task rose or tool accuracy fell. The process metric usually explains the outcome metric.
Beware Goodhart. The moment a metric becomes a target, it stops measuring what you care about. Keep a few metrics you do not optimize directly as honest checks.

Building a Metrics Dashboard

A practical dashboard has three rows. The top row shows outcome metrics — success rate, escalation rate, quality score — trended over time. The middle row shows process metrics — steps per task, tool accuracy, recovery rate. The bottom row shows cost and latency distributions. With these three rows you can answer the only two questions that matter: is the agent working, and if not, why. For a fuller treatment of operationalizing this, see our team rollout guide.

Offline Evaluation Versus Production Monitoring

The metrics you track split into two regimes, and conflating them causes confusion.

Offline evaluation

Before launch, you run the agent against a fixed test set of representative tasks and measure success rate. This is a controlled experiment — same inputs every time — so you can compare versions cleanly. When you change a prompt or a tool, you rerun the offline suite to see whether you improved or regressed. Build this test set early; it becomes the regression guard that lets you change the agent without fear.

Production monitoring

Once live, you measure real traffic, which is messier and unrepeatable. Production metrics catch the inputs your test set never imagined and reveal drift over time. The two regimes are complementary: offline evaluation tells you whether a change is good before you ship it, and production monitoring tells you whether reality matches your expectation after you ship.

Closing the loop

The most valuable pattern is feeding production failures back into the offline test set. Every real failure becomes a permanent test case, so the agent never regresses on a problem you have already seen. This loop is how mature teams compound reliability over time, and it is why the two regimes belong together.

Frequently Asked Questions

What is the single most important agent metric?

Task success rate, defined against a verifiable criterion. Everything else is diagnostic. If you can only track one number, track the fraction of runs that actually accomplished the goal, because without it you are flying blind.

How do I measure success when there is no clear right answer?

Use a rubric and a judge. Define the dimensions of a good outcome, then have a second model or a human grade samples against that rubric. This turns a fuzzy notion of quality into a number you can trend over time.

Why measure steps per task?

It is the clearest diagnostic of loop health. A rising step count usually means the agent is struggling, looping, or recovering from errors. It also directly drives cost and latency, so it is a leading indicator of two problems at once.

How much should I sample for human review?

Enough for a stable signal plus full coverage of failures and escalations. A fixed percentage of all runs gives you a quality baseline, while reviewing every failure ensures you catch new failure modes early. Adjust the percentage based on volume.

How do I avoid optimizing the wrong metric?

Keep honest checks. Pick a few metrics you deliberately do not optimize and use them to validate that your improvements are real. When a metric becomes a target it degrades as a measurement, so protect a couple from that pressure.

Key Takeaways

Start with outcome metrics — task success rate, escalation rate, and graded quality — before anything internal.
Process metrics like steps per task, tool accuracy, and recovery rate diagnose why outcomes move.
Track cost per completed task and the full latency distribution, not averages.
Instrument with structured per-step logs and trace IDs, then sample for human review.
Read trends and segments, correlate process with outcome, and guard against Goodhart's law.

Outcome Metrics Come First

Before any internal metric, decide what success means for the task. Everything else is diagnostic.

Task success rate

Resolution without escalation

Quality of outcome

If you are still establishing what good looks like, our best practices guide covers how to define success criteria before you build.

Process Metrics Diagnose the Loop

Outcome metrics tell you whether the agent works. Process metrics tell you why.

Steps per task. How many loop iterations did the agent take? A creeping average signals the agent is struggling or looping unnecessarily.
Tool call accuracy. When the agent calls a tool, did it pick the right tool with valid arguments? Failed or malformed tool calls are a leading cause of wasted steps.
Loop termination behavior. Does the agent stop when done, or does it run to the step cap? Hitting the cap is rarely a good sign.
Recovery rate. When a tool returns an error, does the agent recover gracefully or spiral? This separates robust agents from brittle ones.

These process metrics are where most debugging happens. The step-by-step guide shows how the loop produces each of these signals.

Cost and Latency Are Not Optional

Agents can be expensive and slow, and both compound at scale.

Cost per task

Latency distribution

Cost-to-success ratio

Combine the two: average cost among successful tasks. This catches the trap where you improve success rate by letting the agent take more steps, quietly tripling cost.

How to Instrument Without Drowning

Measurement is worthless if it is too noisy to read or too sparse to trust.

Log every loop iteration as a structured event

Attach a trace ID to every task

A single task spans many model calls. A trace ID ties them together so you can see the whole loop as one unit. Without this, you are staring at disconnected calls with no story.

Sample for human review

Reading the Signal

Numbers only help if you interpret them correctly.

Watch trends, not snapshots. A 90 percent success rate means nothing in isolation; a success rate falling from 90 to 80 over a week means everything.
Segment by input type. Aggregate metrics hide problems. Break down by task category, customer tier, or input length to find where the agent struggles.
Correlate process with outcome. When success drops, check whether steps-per-task rose or tool accuracy fell. The process metric usually explains the outcome metric.
Beware Goodhart. The moment a metric becomes a target, it stops measuring what you care about. Keep a few metrics you do not optimize directly as honest checks.

Building a Metrics Dashboard

Offline Evaluation Versus Production Monitoring

The metrics you track split into two regimes, and conflating them causes confusion.

Offline evaluation

Production monitoring

Closing the loop

Frequently Asked Questions

What is the single most important agent metric?

How do I measure success when there is no clear right answer?

Why measure steps per task?

How much should I sample for human review?

How do I avoid optimizing the wrong metric?

Key Takeaways

Start with outcome metrics — task success rate, escalation rate, and graded quality — before anything internal.
Process metrics like steps per task, tool accuracy, and recovery rate diagnose why outcomes move.
Track cost per completed task and the full latency distribution, not averages.
Instrument with structured per-step logs and trace IDs, then sample for human review.
Read trends and segments, correlate process with outcome, and guard against Goodhart's law.

Shipped an Agent and Can't Tell If It Works?

Outcome Metrics Come First

Task success rate

Resolution without escalation

Quality of outcome

Process Metrics Diagnose the Loop

Cost and Latency Are Not Optional

Cost per task

Latency distribution

Cost-to-success ratio

How to Instrument Without Drowning

Log every loop iteration as a structured event

Attach a trace ID to every task

Sample for human review

Reading the Signal

Building a Metrics Dashboard

Offline Evaluation Versus Production Monitoring

Offline evaluation

Production monitoring

Closing the loop

Frequently Asked Questions

What is the single most important agent metric?

How do I measure success when there is no clear right answer?

Why measure steps per task?

How much should I sample for human review?

How do I avoid optimizing the wrong metric?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Shipped an Agent and Can't Tell If It Works?

Outcome Metrics Come First

Task success rate

Resolution without escalation

Quality of outcome

Process Metrics Diagnose the Loop

Cost and Latency Are Not Optional

Cost per task

Latency distribution

Cost-to-success ratio

How to Instrument Without Drowning

Log every loop iteration as a structured event

Attach a trace ID to every task

Sample for human review

Reading the Signal

Building a Metrics Dashboard

Offline Evaluation Versus Production Monitoring

Offline evaluation

Production monitoring

Closing the loop

Frequently Asked Questions

What is the single most important agent metric?

How do I measure success when there is no clear right answer?

Why measure steps per task?

How much should I sample for human review?

How do I avoid optimizing the wrong metric?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?