Most teams measure token usage the way they read a utility bill: one big number at the end of the month, vaguely alarming, impossible to act on. They know they spent more than last month. They do not know which feature drove it, whether the extra spend bought anything, or which request types are quietly bleeding margin. A monthly total is an accounting artifact. It is not a metric you can optimize against.
The gap is instrumentation. The token data you need already flows through every API response — input tokens, output tokens, cached tokens — but unless you capture it per request and tag it with context, it evaporates into an aggregate. The teams that actually control token spend are the ones that turned that raw stream into a handful of well-chosen signals they watch the way they watch latency or error rate.
This article defines those signals, explains what each one tells you, and shows how to instrument them so the numbers mean something. The goal is not a dashboard with forty charts. It is four or five metrics that, read together, tell you whether your tokens are working.
What to Measure and Why
Tokens per request, broken down
The atomic unit is tokens per request, split into input, output, and cached. Aggregates hide the story; the split tells you where to act. A request that is 90 percent input tokens is a context problem solved by retrieval or caching. One that is 90 percent output is a generation problem solved by output control. Capture all three on every call.
Cost per accepted output
This is the metric that matters most and the one almost nobody tracks. Take the total token cost and divide by the number of outputs a human or downstream system actually accepted. A change that cuts tokens but raises rejection rate makes this number worse even as the raw bill drops. It is the honest measure of whether optimization helped.
Cache hit rate
If you use prompt caching, hit rate is the difference between a feature paying for itself and not. A low hit rate means your cacheable prefix is unstable — someone is injecting a timestamp or a per-user value into what should be a shared prefix. Watching this number catches cache-busting regressions the day they ship.
Token efficiency ratio
Useful output tokens divided by total tokens consumed. If you send 8,000 input tokens to get a 200-token answer, your ratio is low and retrieval is probably the fix. Tracked over time, this ratio shows whether your prompts are bloating as people add instructions and examples without removing anything.
How to Instrument Without a Heavy Stack
You do not need a new observability platform. You need discipline at the boundary where you call the model.
Log at the call site
Wrap every model call so it emits a structured record: timestamp, feature or route name, model, input tokens, output tokens, cached tokens, latency, and an outcome flag once known. This single log line is the raw material for every metric above.
Tag by feature, not by app
A per-app total is useless for optimization. Tag each call with the specific feature or workflow that triggered it. When spend spikes, you want to know it was the document-summarization path, not just that the app cost more.
Capture the outcome
Cost per accepted output requires an acceptance signal. That might be a thumbs-up, a downstream validation pass, or the absence of a human edit. Even a coarse signal beats none. Without it you are flying on raw volume, which is exactly the trap the common mistakes article warns about.
Reading the Signal
Numbers only help if you know what a healthy reading looks like and what a bad one is telling you.
Trends beat snapshots
A single day's token count means little. The slope matters. A token efficiency ratio drifting down over three weeks is a prompt slowly bloating. A cache hit rate that fell off a cliff on a Tuesday is a deploy that broke your prefix. Watch the direction.
Segment before you conclude
An aggregate spike usually hides a single misbehaving segment. Always break the number down by feature before drawing conclusions. The fix is almost always local — one route, one prompt, one bad retrieval config — not a system-wide problem.
Tie metrics to decisions
Every metric should map to an action. High input ratio means add retrieval. Low cache hit rate means stabilize the prefix. Rising cost per accepted output means a quality regression is eating your savings. If a metric does not change what you do, stop tracking it. The connection between measurement and action is what separates this from the trade-offs you decide once and the metrics you watch continuously.
Building the Loop
Instrumentation is not a one-time project. Wire the logging, build a small set of charts, set thresholds that page or alert, and review them on a regular cadence. When you ship a prompt change, you should be able to read its token impact within a day, not discover it on next month's invoice. That feedback loop is what makes everything in the token budget checklist enforceable rather than aspirational.
Avoiding Metric Pitfalls
Measurement done carelessly is worse than no measurement, because it produces confident wrong conclusions. A few pitfalls catch most teams.
Vanity over decision metrics
A dashboard full of impressive-looking charts that nobody acts on is decoration, not instrumentation. The discipline is to track only metrics that change a decision. If you cannot name the action a metric triggers, drop it. Total tokens consumed is the classic vanity metric — large, alarming, and useless without a per-feature breakdown behind it.
Confusing correlation with cause
A token spike that coincides with a deploy is not proof the deploy caused it; traffic mix, a viral input, or a retry storm can all masquerade as a code change. Before concluding, segment the data and confirm the spike lives where you think it does. The habit of segmenting first prevents the most common false diagnosis.
Ignoring the quality side of the ledger
Watching cost metrics without watching quality metrics is how silent regressions ship. Cost per accepted output guards against this by construction, but only if your acceptance signal is honest. A proxy that marks everything as accepted is worse than useless because it makes a quality regression look like a pure win. Audit your acceptance signal periodically to make sure it still reflects reality, a concern the risks of optimization make concrete.
Measuring too rarely to act
A metric reviewed monthly is a postmortem, not a control. The point of instrumentation is to shorten the loop between a change and its consequence to the point where you can still cheaply reverse a bad decision. If your cadence does not let you catch a regression within a day or two of shipping it, the instrumentation is not doing its job.
Frequently Asked Questions
What is the single most important token metric?
Cost per accepted output. It folds cost and quality into one number and is the only metric that catches the failure where you cut tokens but quietly broke the result. Raw token count alone will mislead you.
How do I measure acceptance if my product has no thumbs-up button?
Use a proxy. A downstream validation pass, a successful schema parse, or the absence of a human edit all work as acceptance signals. A coarse, automatic signal is more useful than a precise one you never collect.
Why is my cache hit rate low?
Almost always because something dynamic is contaminating what should be a stable prefix — a timestamp, a session ID, or per-user text placed before the cacheable content. Move dynamic values after the stable prefix and the rate recovers.
How often should I review token metrics?
Watch trends weekly and review on every significant prompt or model change. Token impact should be visible within a day of a deploy, not deferred to a monthly bill where the cause is long forgotten.
Key Takeaways
- Replace the monthly total with per-request, per-feature metrics you can act on.
- Track cost per accepted output above all — it captures cost and quality together.
- Watch cache hit rate and token efficiency ratio to catch regressions the day they ship.
- Instrument at the call site with structured logs tagged by feature and outcome.
- Read trends and segments, not snapshots, and tie every metric to a specific action.