You cannot improve what you do not measure, and you cannot measure speech recognition with a single number. The industry trained everyone to quote word error rate as if it settled the question, but WER alone has shipped countless systems that score well on paper and fail the people who use them. A transcript can have a low error rate and still drop the one word your workflow depends on.
The reason this matters is that the number you optimize becomes the number you get, even when it is the wrong number. A team that chases aggregate WER will produce a system with good aggregate WER and no guarantee that it captures the names, numbers, and terms its users actually depend on. Choosing the right metrics is not a measurement chore; it is the act of defining what "working" means for your product.
This article defines the metrics that actually predict production success, how to instrument them, and how to read the signal once you have it. It assumes you understand the pipeline; if not, the step-by-step approach to how AI speech recognition works walks through how audio becomes text in the first place. Measurement only makes sense once you know what the system is doing.
The goal is not to collect every possible number. It is to instrument the few metrics that change your decisions and ignore the vanity metrics that do not. A dashboard with twenty charts that nobody acts on is worse than three numbers that trigger real decisions, because the noise hides the signal. Before adding any metric, ask what action a change in it would prompt. If the answer is nothing, do not track it.
Word Error Rate and Its Limits
Word error rate counts substitutions, insertions, and deletions, divided by the number of words spoken. It is genuinely useful as a coarse signal of overall quality, and you should track it. But understand what it hides.
WER weights every word equally. Missing "the" costs the same as missing a patient's medication name. For most real applications, those errors are not equal at all. WER also averages across your whole dataset, which means a model that is excellent on clean audio and terrible on phone calls can post a respectable overall score while failing your hardest segment completely.
Track WER, but never report it as a single aggregate. Break it down by audio condition, speaker accent, and content type. The breakdown is where the truth lives.
Metrics That Matter More Than WER for Real Products
For most production systems, the following metrics predict user satisfaction better than overall WER does.
Entity error rate
Measure accuracy specifically on the words that matter: names, numbers, product SKUs, medication names, addresses. A system can have a fine overall WER and still mangle the entities your downstream workflow depends on. This is usually the number that decides whether users trust the output.
Keyword recall
If your application searches transcripts, what matters is whether the searched-for terms were captured. Keyword recall measures exactly that, and it can diverge sharply from WER when errors cluster on rare but important terms.
Latency percentiles
Average latency lies. Track the 95th and 99th percentiles, because the slow tail is what users actually notice. A system with a great average and an ugly p99 will feel broken during exactly the moments that matter most.
Instrumenting Measurement Correctly
Good metrics come from good data discipline. Three practices separate teams that measure well from teams that fool themselves.
- Hold out a real evaluation set. Curate a fixed sample of your actual production audio with verified human transcripts. Never evaluate on a public benchmark and assume it transfers; it does not.
- Stratify the set. Include clean and noisy audio, multiple accents, and your hardest content. Report metrics per stratum so an average cannot hide a failing segment.
- Re-transcribe with the same normalization. Casing, punctuation, and number formatting wildly affect WER. Normalize reference and hypothesis identically, or your scores are noise.
These practices echo the discipline in our best practices guide, where a held-out, stratified evaluation set is treated as non-negotiable infrastructure.
Reading the Signal
Numbers without interpretation are just decoration. Here is how to turn metrics into decisions.
A rising overall WER with stable per-stratum WER usually means your traffic mix shifted, not that the model degraded; you are simply seeing more hard audio. A spike in entity error rate with flat WER means errors are concentrating on the words that matter, which is a sign to deploy vocabulary biasing rather than to swap models. A clean WER paired with falling keyword recall means rare terms are being lost, which points at the same fix.
When latency p99 climbs while p50 stays flat, you have a tail problem, often GPU contention or batching, not a model problem. Reading these patterns correctly is what keeps you from spending a month optimizing the wrong layer. Our common mistakes post covers what happens when teams misread these signals.
Production Monitoring Versus Offline Evaluation
There are two distinct measurement contexts, and conflating them causes confusion. Offline evaluation runs your model against a fixed, human-verified reference set; it gives you ground-truth accuracy but only on the audio you curated. Production monitoring observes the live system on real traffic, where you usually do not have reference transcripts and therefore cannot compute true WER.
In production, lean on proxy signals: confidence distributions, the rate at which users correct or reject output, downstream task success, and latency. A drop in average confidence or a spike in user corrections is an early warning that quality has shifted, even without a reference transcript. Use offline evaluation to establish the ground truth and validate changes, and use production monitoring to catch drift between formal evaluations. Neither replaces the other; teams that rely only on offline numbers miss live regressions, and teams that rely only on production proxies never know their true accuracy.
Setting Thresholds and Alerts
Metrics are only operationally useful when they trigger action. Define a target for each metric per stratum, alert when any stratum crosses its threshold, and review the breakdown rather than the aggregate during incidents. Set the entity error rate threshold tighter than the WER threshold, because entity errors are the ones users punish you for. Wire these alerts into the same monitoring you use for the rest of your stack so speech quality is not a separate, neglected dashboard.
Frequently Asked Questions
Is a lower WER always better?
Not necessarily. A model with slightly higher WER but much better entity accuracy can be the better product if your workflow depends on names and numbers. Optimize for the metric that maps to user value, not the headline number.
How big should my evaluation set be?
Large enough that per-stratum scores are stable run to run, which usually means at least a few hours of audio per stratum. The key is stratification and realism, not raw size; a small set of genuinely representative audio beats a large set of mismatched data.
Why track latency percentiles instead of the average?
Because users experience the slow requests, not the average. A great p50 with an ugly p99 still produces frustrated users during peak load. The tail is where the real experience lives.
What is entity error rate and how do I compute it?
It is the error rate measured only on the tokens you care about, such as names, numbers, and domain terms. Tag those tokens in your reference transcripts, then compute substitutions, insertions, and deletions on that subset alone.
How often should I re-evaluate?
Continuously in production via sampled monitoring, and formally whenever you change the model, the audio pipeline, or your traffic mix. Speech systems drift as your user base and recording conditions change.
Key Takeaways
- Word error rate is a useful coarse signal but hides per-segment failures and treats every word as equally important.
- Entity error rate, keyword recall, and latency percentiles often predict real user satisfaction better than aggregate WER.
- Measure on a held-out, stratified set of your actual production audio, with identical normalization on reference and hypothesis.
- Read metrics as patterns: diverging entity error and WER point to vocabulary fixes, not model swaps.
- Wire per-stratum thresholds and alerts into your existing monitoring so speech quality is actively governed, not occasionally checked.