You cannot manage AI compliance drafting on vibes. "It seems faster" and "the drafts look good" are not signals; they are the absence of signals. When a hallucinated citation slips into a filed document, the post-mortem always reveals that the warning signs were present and unmeasured. The work of instrumenting this is not glamorous, but it is the difference between catching drift early and explaining it to a regulator later.
The good news is that the metrics that matter here are few and cheap to capture. You do not need a measurement platform. You need a handful of numbers recorded consistently and read with a skeptical eye. This piece names those metrics, shows how to instrument them with little more than a spreadsheet, and explains how to interpret each one, because a number you cannot read is just noise.
What You Are Actually Measuring
Two things, really: whether the drafting is producing correct output, and whether it is producing it efficiently. Most teams measure only the second and discover the first the hard way.
The Two Families
- Quality signals: are the drafts correct, grounded, and defensible?
- Efficiency signals: is the process actually faster and cheaper than the alternative?
- The trap is optimizing efficiency while quality silently degrades, which is the most expensive way to be fast.
Quality Metrics
These are the ones that protect you. Track them even when, especially when, everything seems fine.
Hard-Stop Rate
- The share of drafts that hit a blocking finding in review (invented citation, unauthorized commitment).
- A rising rate means your inputs are thinning or your prompts are drifting. Read it as an early warning, not a verdict.
- These hard-stop findings come straight from A Working Review List for AI-Drafted Legal and Compliance Text.
Citation Verification Failure Rate
- Of all citations the model produced, what fraction failed verification against a primary source?
- This number should trend toward zero as your grounding improves. If it does not, your Reference stage is weak.
Post-Approval Defect Rate
- Errors found after a document was approved, per hundred documents.
- The most important and most uncomfortable metric, because it measures what your review missed.
Efficiency Metrics
These justify the investment, but only in the context of the quality numbers above.
Time to Approved Draft
- Median time from request to a draft that clears review.
- Improving while quality holds is the win; improving while defect rate rises is a warning.
Human Edit Volume
- How much a reviewer had to change per draft.
- Falling edit volume with stable quality signals a maturing process. Falling edit volume with rising defects signals reviewer fatigue, not improvement.
Rework Rate
- Drafts sent back for regeneration rather than patched.
- High rework points upstream to thin inputs, echoing the trade-offs in Speed Versus Defensibility When AI Drafts Compliance Language.
Instrumenting Without Heavy Tooling
You can run all of this from a single shared sheet if you capture the right fields at the right moments.
Minimum Viable Instrumentation
- Log per draft: document type, model used, hard-stop findings, citation failures, edit volume, approval time.
- Capture at the moment of review, not from memory afterward, or the numbers will flatter you.
- Tag the prompt or template version, so you can attribute a change in metrics to a change in process.
Keeping It Honest
- Have the reviewer, not the drafter, record quality findings, to remove the optimism bias.
- Sample post-approval defects through periodic re-review rather than waiting for someone external to find them.
Leading Versus Lagging Signals
Not all metrics warn you at the same time. Some tell you a problem is coming; others tell you it already arrived. Knowing which is which changes how you react.
Leading Signals
- Citation verification failure rate and hard-stop rate move before defects reach approved documents.
- A rising leading signal is an invitation to fix grounding now, while it is still cheap.
- These are the numbers to watch weekly, because their whole value is early warning.
Lagging Signals
- Post-approval defect rate tells you what already escaped, which is information you cannot act on preventively.
- A bad lagging number means a leading signal was ignored or missing; treat it as a prompt to find which.
- Lagging signals confirm whether your leading signals are actually predictive, closing the loop on your measurement itself.
The discipline is to drive decisions off leading signals and use lagging signals to check whether the leading ones are telling the truth. A team that only watches defects is always reacting; a team that watches grounding quality is preventing.
Segmenting the Numbers
An aggregate metric can hide a serious problem in one slice. Break the numbers down before you trust them.
Useful Cuts
- By document type, because a privacy notice and an internal memo have very different defect profiles.
- By model and prompt version, so you can attribute a change in quality to a change you made.
- By drafter, not to assign blame but to find where a workflow needs reinforcement or a template is missing.
A citation failure rate that looks healthy in aggregate may be concentrated entirely in one high-exposure document type, which is exactly the slice you cannot afford to let drift. The trade-off logic for which slices deserve the most scrutiny is laid out in Speed Versus Defensibility When AI Drafts Compliance Language.
Reading the Signal
Numbers without interpretation cause as many bad decisions as no numbers. A few reading rules.
Interpretation Rules
- Watch direction, not absolute level. A hard-stop rate of ten percent is fine if it is falling and alarming if it is climbing.
- Read quality and efficiency together. Speed gains that coincide with rising defects are not gains.
- Treat any post-approval defect as a process question, not an individual one: what check would have caught it, and why was it missing?
- Connect the metrics to the business picture in What AI-Assisted Compliance Drafting Saves, and What It Costs.
- Resist the urge to celebrate a single good month; trends earn trust, and one quiet stretch can hide an input that is quietly thinning beneath it.
- When two metrics disagree, believe the quality signal over the efficiency signal, because the cost of misreading quality is the one this whole discipline exists to avoid.
The point of reading rules is to keep the numbers honest. A metric program that flatters the team is worse than no program, because it manufactures confidence exactly where caution is warranted. Treat every favorable trend as a hypothesis to keep testing, not a conclusion to relax on.
Frequently Asked Questions
Which single metric matters most?
Post-approval defect rate. It measures what your entire process let through, which is the thing that actually hurts you. Every other metric is an early indicator of where that number is heading.
How many drafts do I need before the numbers mean anything?
Enough that a single bad draft does not swing the rate wildly, which is usually a few dozen. Below that, track the raw findings rather than rates and watch for patterns rather than trends.
Will not tracking defects discourage the team?
Only if defects are treated as individual failures. Frame every defect as a missing check in the process, and the metric becomes a tool for improving the system rather than a stick. Reviewers who feel safe reporting produce honest numbers.
Can I automate any of this measurement?
Citation verification and edit volume can be partly automated. Quality findings that depend on knowing what the business agreed to still need a human, so plan for the measurement to be partly manual indefinitely.
What is a healthy citation verification failure rate?
Trending toward zero. Any nonzero rate that is not falling means your grounding is too weak, because a well-grounded model should rarely produce citations it cannot support. Treat a stuck rate as a Reference-stage problem.
How often should I review these metrics?
Quality signals weekly while the process is young, monthly once stable. Efficiency signals monthly is usually enough. The cadence should be fast enough to catch drift before it reaches a filed document.
Key Takeaways
- "It seems faster" is the absence of a signal; instrument quality and efficiency as separate families.
- The most important metric is post-approval defect rate, because it measures what your review missed.
- Watch direction over absolute level, and always read quality and efficiency together.
- A single shared sheet captured at review time is enough; the reviewer, not the drafter, records quality findings.
- Treat every post-approval defect as a missing check in the process, not as an individual failure.