The decision between zero-shot and few-shot is empirical, which means it lives or dies on measurement. The mistake almost everyone makes is measuring only headline accuracy on a handful of inputs. That single number hides bias, ignores cost, and tells you nothing about stability. This article defines the metrics that actually matter, how to instrument each, and — the part most guides skip — how to read the signal once you have it.
The organizing principle: every metric exists to answer one question, do these examples earn their tokens? Accuracy is the benefit side. Cost and latency are the price side. Stability metrics tell you whether the benefit is real or an artifact of how you tested.
Accuracy, But Per Category
Aggregate accuracy is the most common metric and the most misleading on its own. A classifier can post 90% overall while failing badly on a minority category that happens to be the one you care about.
How to read it
Always break accuracy down per category or field. The per-category view is what tells you whether a zero-shot failure is concentrated (a missing definition for one category, fixable with a better instruction) or diffuse (genuine need for demonstration). This distinction drives the whole decision, as our framework explains.
Instrument it with a labeled eval set of real inputs, refreshed when your distribution shifts, scored by category.
Output Distribution and Label Bias
This is the metric teams forget, and it catches the bias that accuracy hides. Compare the distribution of predicted labels against the true distribution in a balanced test set.
If your classifier predicts the majority label more often than it should, you have label or recency bias from your example set — exactly the failure mode in our common mistakes guide. A model can be reasonably accurate overall while systematically skewing on ambiguous inputs, and only the distribution view reveals it.
Read it by plotting predicted versus true label frequencies. A skew toward whichever label was most common or most recent in your examples is the signal to rebalance.
Stability Under Perturbation
Few-shot prompts can be fragile. A genuinely good prompt gives the same answer when you reorder the examples; a fragile one flips.
How to instrument
Take a sample of ambiguous inputs and run each under several example orderings. Measure how often the prediction changes. High variance means order or recency bias you must fix before trusting any accuracy number — because the accuracy you measured depended on one arbitrary ordering.
Token Cost Per Call
The price side of the trade-off. Few-shot examples add tokens to every single request, and at volume this is a real cost line.
Instrument per-call input token count, attributed to prompt version, via your observability layer. Read it as cost-per-decision: an example set that adds 1,400 tokens at your volume has a specific monthly dollar figure. That figure is what you weigh against the accuracy gain, and it is what eventually signals a move to fine-tuning, as the trade-offs guide covers.
Latency, at Percentiles
Examples inflate time-to-first-token because the model processes a longer prompt before responding. Average latency hides the tail; users feel the tail.
Track p50, p95, and p99 latency by prompt version. If few-shot pushes your p95 past what the user experience tolerates, the accuracy gain may not be worth it regardless of cost. Read latency alongside cost — both are the price you pay for examples.
The Composite Read: Accuracy-per-Token
The metric that ties it together is accuracy gain per added token. A few-shot variant that lifts accuracy two points while adding 1,500 tokens is a different proposition from one that lifts ten points adding the same tokens.
Compute it by comparing each variant against the zero-shot baseline: delta accuracy over delta tokens. This single ratio captures the core trade-off and makes variants directly comparable. The variant with the best accuracy-per-token that clears your latency budget is usually the right ship. For real examples of this read in action, see the case study.
Confidence and Calibration
A metric most teams ignore: how well the model's confidence matches its actual accuracy. When a model returns a probability or you can elicit a confidence score, you want high confidence to correspond to high accuracy and low confidence to flag the inputs a human should review.
Few-shot and zero-shot can differ sharply here. A few-shot prompt biased by its example set may be confidently wrong on inputs that resemble its examples but differ in a way that matters. Measure calibration by bucketing predictions by stated confidence and checking accuracy within each bucket. If your 90%-confidence predictions are only 70% accurate, the model is overconfident, and any routing logic that trusts confidence will misfire. This matters most when you use the model's confidence to decide what to escalate to humans — poor calibration there means you either review too much or trust too much.
Cost of Errors, Not Just Error Rate
Accuracy treats every mistake equally, but your business does not. Misrouting a billing ticket to the bug queue may cost minutes; misclassifying a churn-risk account as healthy may cost a renewal. The metric that matters is error cost, not error count.
Instrument this by weighting your eval set's errors by their real consequence. A few-shot variant that improves overall accuracy while increasing errors on your highest-cost category is a regression, even though the headline number went up. This reframing changes which prompt you ship: sometimes the lower-accuracy variant is correct because its mistakes are cheaper. Build a simple cost matrix per category and score variants against it alongside raw accuracy.
How Often to Measure
Metrics are not a one-time gate. The right cadence:
- On every prompt change — re-run the full eval before shipping any change to instruction or examples.
- On every model upgrade — re-baseline zero-shot to capture deletable examples and catch regressions the new model introduces.
- On a recurring schedule — refresh the eval set from production when your input distribution shifts, so the metrics keep reflecting reality.
A metric measured once and never again is a snapshot of a moving target. The teams that win treat their eval harness as living infrastructure, not a launch checklist.
Building a Metric Dashboard That Drives Decisions
Individual metrics are only useful if they sit together where you can read them against each other. The dashboard that actually drives the zero-shot-versus-few-shot decision puts five things side by side, per prompt variant: per-category accuracy, output label distribution, stability score under reordering, token cost per call, and p95 latency.
The reason they belong together is that no single number is sufficient and several are in tension. A variant might win on accuracy while losing on cost and latency; another might look stable until you notice it skews the label distribution. Reading them in isolation leads to shipping the wrong prompt. Reading them together — and computing the accuracy-per-token ratio across them — turns a pile of numbers into a decision. Build the dashboard so that comparing two variants is a glance, not a spreadsheet exercise, because the comparison is something you will do on every prompt change and every model upgrade.
The Trap of Optimizing a Single Metric
The most common metrics mistake is not measuring too little but optimizing one number too hard. A team fixates on aggregate accuracy and tunes a prompt that posts an impressive headline figure while quietly inflating token cost, skewing the label distribution, and pushing p95 latency past the user's tolerance. The metric went up; the system got worse.
This is why the composite read matters. Every metric is a constraint as much as a target. Accuracy must clear your bar, but so must latency and cost, and the distribution must stay balanced. The right prompt is the one that satisfies all the constraints at once, not the one that maximizes any single metric. Whenever you find yourself celebrating a number, check what the other four did before you ship.
Frequently Asked Questions
Why is aggregate accuracy not enough?
It hides per-category failures and label bias. A prompt can score well overall while failing on the one category you care about or systematically skewing on ambiguous inputs. Per-category accuracy and output distribution reveal what the aggregate conceals.
How do I measure label bias?
Compare predicted label frequencies against the true distribution on a balanced test set. A skew toward the most common or most recent example label signals bias from your example set that you should rebalance before shipping.
What is the stability metric actually telling me?
How much your measured accuracy depends on an arbitrary example ordering. If predictions flip when you reorder examples, your accuracy number is unreliable and the prompt needs label balancing before you trust any other metric.
How do I weigh accuracy against token cost?
Compute accuracy gain per added token versus the zero-shot baseline, and read it against a real dollar figure at your volume. A small accuracy lift that adds many tokens often fails this test, especially at high request volumes.
Which latency metric matters most?
p95, not the average. The average hides the slow tail that users actually feel. If few-shot pushes p95 past your UX tolerance, the accuracy gain may not justify it regardless of cost.
Key Takeaways
- Measure accuracy per category, never just the aggregate — it hides concentrated failures.
- Check the output label distribution to catch bias the accuracy number masks.
- Test stability under example reordering before trusting any accuracy figure.
- Track per-call token cost and p95 latency as the price side of the trade-off.
- Use accuracy-per-added-token as the composite metric to compare variants.