The metrics most teams use to evaluate AI coding assistants are the ones the tools hand them: acceptance rate, suggestions shown, lines generated. These numbers are easy to collect and reliably go up, which makes them feel like evidence. They are not. They measure how much the tool is used, not whether using it made anything better. A team can post a rising acceptance rate while shipping slower and buggier, and the dashboard will look like a success.
Measuring an AI coding assistant honestly means instrumenting outcomes, not activity. The question is never how many suggestions were accepted; it is whether the work got faster, the code got safer, and the team got more done without paying for it later. Those are harder to measure, which is exactly why they are worth measuring — the easy numbers are easy because they dodge the real question.
This piece defines the KPIs that actually carry signal, explains how to instrument them without distorting behavior, and shows how to read the results past the noise. The goal is a measurement practice that tells you the truth even when the truth is that the tool is not helping a particular kind of work.
Why Activity Metrics Mislead
The tool-provided metrics are seductive and hollow.
What Acceptance Rate Actually Measures
Acceptance rate measures how often developers press accept. A high rate can mean the suggestions are excellent, or it can mean developers are accepting unread code, which is a failure mode rather than a success. The metric cannot tell the two apart, so it carries no reliable signal on its own. This is one of the traps in Seven Failure Modes That Quietly Wreck AI Pair Programming.
The Goodhart Problem
When acceptance rate becomes a target, it stops measuring anything. Developers optimize toward it by accepting more, including low-value suggestions, and the number rises while quality falls. Any activity metric used as a goal corrupts the behavior it measures.
The Outcome Metrics That Matter
A few outcome KPIs carry the real signal.
Cycle Time
The time from starting a task to merging it. If the assistant helps, cycle time falls on the task types where it is strong. Cycle time is hard to game and directly tied to value, which makes it the anchor metric.
Defect Escape Rate
The share of defects that reach later stages — review, staging, production — rather than being caught early. If the assistant is producing plausible-but-wrong code that slips through, this rate rises even as velocity appears to climb. It is the counterweight that keeps speed honest.
Review Turnaround and Burden
How long reviews take and how much rework they generate. A healthy adoption keeps changes small and reviewable; an unhealthy one floods review with large generated diffs. Rising review burden is an early warning sign.
Rework Rate
How often merged code is reverted or substantially rewritten soon after. High rework means the speed was illusory, with the saved time reappearing as later correction.
How to Instrument Honestly
Good instrumentation avoids distorting the behavior it observes.
Segment by Task Type
Aggregate numbers hide the truth because the assistant helps boilerplate and hurts architecture. Segmenting by task type reveals where it actually helps, matching the pattern in Where AI Coding Assistants Shine and Where They Stumble. A single blended figure can mask both a big win and a quiet loss.
Establish a Baseline First
You cannot detect improvement without a pre-adoption baseline. Measure cycle time, defect escape rate, and rework before rolling out, so the comparison is real rather than imagined. The case study in How One Five-Person Studio Shipped Faster With Coding AI shows the cost of skipping this step.
Avoid Targeting Activity Metrics
Track acceptance rate if you like, but as a leading diagnostic only, never as a goal. The moment it becomes a target, it stops telling the truth.
Reading the Signal
The numbers only help if you interpret them together.
Pair Speed With Quality
Never read cycle time alone. A drop in cycle time paired with a rise in defect escape rate is not a win; it is borrowing speed against future incidents. The two metrics must be read as a pair.
Watch the Trend, Not the Point
A single period's numbers are noisy. Trends over several periods reveal whether adoption is improving, stable, or quietly degrading. The shape of the curve matters more than any one reading.
Connect Metrics to Practice Changes
When a metric moves, ask what practice changed. Often the lever is not the tool but a habit — smaller increments, harder test review — as the framework in The Draft, Review, and Verify Loop for Working With Coding AI makes explicit.
Building a Lightweight Measurement Practice
You do not need a data team to measure honestly. A few disciplined habits, run consistently, beat an elaborate dashboard run sporadically.
Start Small and Consistent
Pick two metrics — cycle time and defect escape rate — and track them faithfully rather than tracking a dozen metrics inconsistently. Consistency over several periods is what produces a trustworthy trend, and two well-tracked numbers tell you more than ten noisy ones.
Pull From Tools You Already Have
Your version control history gives you cycle time. Your issue tracker and review comments give you defect location and rework. The data already exists; the work is in extracting and segmenting it, not in new instrumentation. Resist the urge to buy tooling before you have exhausted what your existing systems already record.
Make Measurement a Team Ritual
Review the numbers together on a fixed cadence and discuss what moved and why. A metric nobody looks at changes nothing. The ritual of reading the trend together is what converts measurement into decisions.
When the Signal Says Stop
Honest measurement sometimes recommends against the tool, and a good practice acts on that.
Recognizing a Bad Fit
If cycle time does not improve on the task types where the assistant should help, or defect escape rate climbs and stays up, the tool may not fit that work. This is a real finding, not a measurement error, and it deserves to change behavior rather than be explained away.
Acting on the Finding
Narrow the assistant's use to the task types where the metrics show benefit, and stop forcing it onto work where they do not. Restricting usage based on evidence is a more sophisticated outcome than blanket adoption or blanket rejection, and it is only available to teams that measure.
Frequently Asked Questions
Is acceptance rate completely useless?
Not completely. As a diagnostic read alongside outcome metrics, it can hint at problems, such as a very high rate paired with rising defects suggesting unread acceptance. As a standalone success metric or a target, it is worse than useless.
What is the single most important metric?
Cycle time, anchored against defect escape rate. Speed without the quality counterweight is misleading, so the pair is the real answer.
How long before metrics show a reliable signal?
Several reporting periods, because single-period numbers are noisy. Trends are trustworthy where point readings are not.
Do I need special tooling to measure this?
Mostly no. Cycle time, defect location, and rework come from your existing version control and issue tracker. The discipline is in segmenting and baselining, not in new tools.
Why segment by task type?
Because the assistant helps some work and hurts other work, a blended number hides both effects. Segmentation is what turns a flat average into an actionable signal.
What if the metrics show no improvement?
That is a valid and useful result. It may mean the tool does not fit your work, or that practices need to change. Honest measurement that reports no gain is doing its job.
Key Takeaways
- Activity metrics like acceptance rate measure usage, not value, and corrupt behavior when targeted.
- Cycle time, defect escape rate, review burden, and rework rate carry the real signal.
- Always pair a speed metric with a quality metric; speed alone is misleading.
- Segment by task type, because the assistant helps some work and hurts other work.
- Establish a pre-adoption baseline so improvement can be detected rather than assumed.
- Read trends over several periods and connect metric moves to specific practice changes.