The decision is rarely whether to classify text. It is how. You have at least four credible options: zero-shot prompting with no examples, few-shot prompting with a handful of examples, fine-tuning a model on your own labeled data, and a traditional supervised classifier. Each wins under different conditions, and choosing the wrong one wastes either money or accuracy, sometimes both.
This article lays out the competing approaches, the axes on which they genuinely differ, and a decision rule you can apply to your own situation. The goal is not to crown zero-shot as the universal answer. It is to help you recognize the conditions under which zero-shot is the right tool and the conditions under which reaching for it is a mistake.
The honest summary up front: zero-shot wins when you have no labeled data, need to move fast, and your categories are reasonably distinct. The moment you have labeled data and stable, high-volume needs, the other options start to pull ahead. The rest of this piece makes that precise.
The Axes That Actually Differentiate
Labeled data availability
This is the dominant axis. Zero-shot needs zero labeled examples. Few-shot needs a handful. Fine-tuning and supervised classifiers need hundreds to thousands. If you have no labels and no budget to create them, the field narrows immediately.
Accuracy ceiling
A fine-tuned model on plentiful data generally reaches a higher accuracy ceiling than zero-shot, because it has learned your specific distribution. Zero-shot's ceiling is set by how well the categories can be described in language.
Cost shape and volume
Zero-shot and few-shot cost per call, which scales with volume. Fine-tuning has a large upfront cost and cheaper inference. At very high volume the cost shapes cross, a calculation detailed in Defending the Spreadsheet When You Skip the Labeling Budget.
- Labeled data: the dominant constraint
- Accuracy ceiling: fine-tuning highest with enough data
- Cost shape: per-call versus upfront, crossing at high volume
- Time to first result: zero-shot fastest, fine-tuning slowest
Zero-Shot Prompting
Where it wins
No labeled data, fastest time to a working result, and trivial to change categories, you just edit the prompt. For distinct categories and short, intent-rich text, accuracy is often more than adequate. The examples in Classifying Support Tickets Without a Single Labeled Example show the sweet spot.
Where it loses
Subtle category boundaries, very high volume where per-call cost dominates, and tasks where the accuracy ceiling matters more than speed. It also offers less consistency than a fine-tuned model on repeated runs.
Few-Shot Prompting
Where it wins
When you have even a handful of labeled examples, few-shot often lifts accuracy on ambiguous categories meaningfully over zero-shot, while keeping the flexibility of prompting. It is the natural next step when zero-shot validation reveals a stubborn category.
Where it loses
You must curate balanced, representative examples, which is real work, and each example consumes tokens on every call, raising cost. Poorly chosen examples can hurt rather than help.
Fine-Tuning and Supervised Classifiers
Where they win
With hundreds to thousands of labeled examples, stable categories, and high volume, fine-tuning delivers the highest accuracy ceiling and the cheapest per-call inference. Traditional classifiers can be even cheaper and faster for simple, well-bounded problems.
Where they lose
They demand labeled data and engineering effort, and they are slow to change: a new category means relabeling and retraining. For a problem that shifts or has no labels, this rigidity is disqualifying.
The Decision Rule
A practical sequence
Start with the data question. No labels and need speed? Zero-shot. Validation shows a weak category and you can produce a few examples? Few-shot. Stable categories, plentiful labels, and high volume? Fine-tuning. This ladder lets you escalate only when the evidence demands it, which is the loop described in Naming the Stages That Turn Raw Labels Into Reliable Sorting.
Do not skip rungs
Jumping straight to fine-tuning before validating that you even need it is the most expensive mistake. Zero-shot is cheap enough that proving the problem with it first almost always pays for itself, even if you eventually move on.
The Hidden Axes People Forget
Maintenance burden over time
The upfront comparison usually ignores what each approach costs to keep alive. A zero-shot classifier changes with a prompt edit and a re-audit. A fine-tuned model changes with relabeling and retraining. If your categories evolve even occasionally, the maintenance gap compounds into a large difference that no launch-day cost comparison captures.
Consistency and auditability
Some domains care less about peak accuracy than about applying the same criterion every time, defensibly. Fine-tuned models are more deterministic across runs, while zero-shot varies a little. Where an output must be auditable, the consistency axis can outweigh a small accuracy difference, and it rarely appears in a naive comparison.
Time to adapt to a new requirement
When the business asks for a new category next week, how fast can each approach deliver it. Zero-shot answers in minutes, few-shot in hours, fine-tuning in days or weeks. For fast-moving requirements this responsiveness axis can dominate every other consideration.
- Maintenance burden compounds when categories evolve
- Consistency and auditability can outweigh peak accuracy
- Time to adapt favors prompting heavily
Common Mismatches and How to Avoid Them
Choosing fine-tuning for an unstable problem
Teams sometimes pick fine-tuning for its accuracy ceiling, then discover their categories shift quarterly, forcing repeated retraining that erases the advantage. If your taxonomy is not stable, the rigidity of fine-tuning is a liability regardless of its accuracy, a fluidity point explored in What Shifts in Labelless Text Sorting Through 2026.
Choosing zero-shot for a subtle, high-volume task
The opposite mismatch is reaching for zero-shot on a task with subtle category boundaries and enormous volume. Here the accuracy ceiling and per-call cost both work against you, and the discipline is to recognize the signal early through measurement, as the framework's Verify stage in Naming the Stages That Turn Raw Labels Into Reliable Sorting is designed to catch.
Letting sunk cost dictate the choice
Once a team has invested in labeling, they feel obligated to fine-tune even when zero-shot would now suffice. The labels are not wasted; they make an excellent audit set. Let the current evidence, not the past investment, drive the decision.
Ignoring the cost of being wrong slowly
A subtler mismatch is choosing the approach that looks cheapest today without accounting for how its errors accumulate. A classifier that drifts quietly because no one budgeted for re-audits can cost more in bad decisions than a slightly pricier approach that stays accurate. Factor the cost of undetected errors into the comparison, not just the cost of running the thing.
Making the Decision Defensible
Write down the axis weights
Before comparing options, decide which axes matter most for this specific problem and write the weighting down. A team that agrees in advance that maintenance and adaptability outweigh peak accuracy will not be seduced later by a fine-tuned model's headline number. The weighting is the decision; the comparison just applies it.
Pilot the cheap option first
When the choice is genuinely close, run a zero-shot pilot before committing to anything heavier. It costs little, produces a measured error rate, and either confirms the cheap option suffices or proves you need more, while generating the audit set the heavier option would require anyway. This pilot-first discipline mirrors the build sequence in Naming the Stages That Turn Raw Labels Into Reliable Sorting.
Revisit the decision on a schedule
A choice that was correct at one volume and model price may not stay correct. Put a date on the calendar to re-examine the trade-off as conditions change, a habit that matters more as model costs keep falling.
Frequently Asked Questions
Is fine-tuning always more accurate than zero-shot?
Only with enough quality labeled data. On small or noisy datasets, fine-tuning can underperform a well-written zero-shot prompt while costing far more. The accuracy advantage is real but conditional on data volume and quality.
Can I mix approaches?
Yes, and many strong systems do. A common pattern uses zero-shot for the easy, high-volume categories and routes ambiguous cases to few-shot or human review. Hybrid designs often beat any single approach.
How much labeled data justifies fine-tuning?
There is no universal number, but below a few hundred clean examples per category, fine-tuning rarely justifies its cost over few-shot prompting. Above several thousand with stable categories, fine-tuning's ceiling and cheap inference start to dominate.
Does zero-shot consistency matter for production?
It can. Zero-shot outputs vary slightly more across runs than a fine-tuned model. For high-stakes consistency, either lower the model's randomness, add structure, or move to fine-tuning once you have the data.
Key Takeaways
- The dominant axis is labeled data availability; with none and a need for speed, zero-shot is the natural choice.
- Fine-tuning reaches the highest accuracy ceiling but only with hundreds to thousands of clean labels and stable categories.
- Cost shape differs: zero-shot and few-shot cost per call, fine-tuning costs upfront, and they cross at high volume.
- Escalate up the ladder only when validation demands it; jumping straight to fine-tuning is the most expensive mistake.
- Hybrid designs, zero-shot for easy categories and few-shot or human review for hard ones, often beat any single approach.