Most advice about zero-shot versus few-shot prompting is generic enough to be useless: "use few-shot for complex tasks." That tells you nothing actionable. The practices below come from shipping prompts that run thousands of times a day, where a 5% accuracy difference or a doubled token bill is a number someone has to answer for.
The throughline of every practice here is the same: treat the choice as an engineering decision with measurable trade-offs, not a stylistic preference. You earn the right to add examples; you do not add them by default.
Start Zero-Shot, Earn Your Way to Few-Shot
The default should always be zero-shot with a precise instruction. Examples are an expense — tokens, latency, and curation effort — and you should add them only when an eval proves they pay for themselves.
Why this order matters
Starting few-shot lets you skip the work of writing a clear instruction, and that debt compounds. When you eventually need to scale, change models, or trim cost, you discover the real specification lived in the examples and nobody wrote it down. Zero-shot first forces clarity early.
In practice: write the sharpest instruction you can, run it against a labeled set, and only reach for examples on the inputs it fails.
Make the Instruction Carry the Specification
Whether or not you use examples, the instruction should fully describe the task: the output format, the edge-case handling, the tone, the constraints. Examples reinforce; they should never be the sole source of truth.
A good test: hand your instruction to a competent human with no examples. If they can produce the right output, your instruction is doing its job. If they need to see examples to understand the task, the instruction is underspecified.
This discipline pays off when you switch models, because a strong instruction transfers far better than a hand-tuned example set.
Choose Few-Shot Examples Like a Statistician
When you do use examples, treat the set as a sample of your real input distribution, not a highlight reel.
- Pull from production data, including the ugly cases — typos, mixed formats, ambiguous inputs.
- Balance the labels so the model does not inherit a majority bias.
- Cover the edges deliberately; two hard examples teach more than six easy ones.
- Cap the count — find the minimum number that captures the accuracy gain, usually two to five.
The common mistakes guide details how unrepresentative examples quietly degrade accuracy.
Control for Order and Recency Bias
Few-shot models weight recent and majority examples more heavily. If your last example is "positive," ambiguous inputs drift positive. This is not noise; it is a systematic bias you can measure.
The practice: shuffle example order across calls when you can, and audit the output distribution against a balanced test set. If reordering examples changes your answers, fix the balance before shipping.
Instrument Before You Optimize
You cannot improve what you do not measure. Before tuning anything, stand up:
- A labeled eval set of real inputs, refreshed from production periodically.
- Per-call token counts so you see the cost of every example.
- Latency tracking for time-to-first-token, which examples directly inflate.
Without this, "few-shot is better" is an opinion. With it, you know the exact accuracy, cost, and latency delta of each example. The metrics guide walks through what to instrument and how to read the signal.
Re-Evaluate on Every Model Change
Prompts rot. A five-shot prompt tuned for an older model is frequently over-engineered for a newer one that solves the task zero-shot. Carrying forward stale prompts means paying the example tax forever.
The re-test ritual
Each time you upgrade models, strip the examples and re-run your zero-shot baseline. If accuracy holds, delete the examples. We routinely cut prompts in half this way after a model upgrade — pure cost savings for equal quality.
Know When to Stop Prompting and Start Fine-Tuning
Prompting has a ceiling. When a task is narrow, stable, and extremely high-volume, the cost of re-sending long example prompts on every call eventually exceeds the cost of fine-tuning. Fine-tuning also gives more consistent behavior than examples, which the model re-interprets on every request.
The practice: track your monthly token spend on few-shot examples specifically. When that line gets large and the task is stable, model a fine-tune. For the full decision logic, see the trade-offs guide and A Framework for Zero Shot vs Few Shot Learning.
Target Examples at Failures, Not at the Whole Task
When zero-shot falls short, the reflex is to add examples across the board. The better practice is surgical: identify exactly which inputs fail, and add examples aimed only at those. If a support classifier nails seven categories zero-shot and only fumbles a rare eighth, you add one or two examples for the eighth category — not a balanced set spanning all eight.
This keeps the prompt lean and the cost proportional to the problem. Blanket example sets pay tokens to re-teach categories the model already handles. The discipline is to treat each example as a fix for a named failure, with the failing input documented. If you cannot point to the specific input an example addresses, you probably do not need that example.
Write for Portability Across Models
Models churn. The prompt you ship today may run on three different model versions over its life. The practice that survives this is portability: keep the specification in the instruction, where it transfers, rather than in hand-tuned examples, which often do not.
A prompt whose accuracy depends on five carefully ordered examples is brittle — change the model and the careful tuning may no longer hold. A prompt carried by an explicit instruction degrades gracefully and frequently improves on a newer model with no changes at all. When you must use examples, prefer the smallest set that captures the gain, so there is less tuning to break. Portability is why the instruction-first habit pays compounding dividends, a point our framework builds its whole maintenance loop around.
Document Why Each Example Exists
The most under-practiced discipline is writing down, next to each example, what it teaches and which failure it fixes. Months later, when someone wants to trim the prompt for cost or adapt it to a new model, this note is the difference between a confident cut and a fearful one.
Without documentation, example sets become untouchable — nobody remembers why each one is there, so nobody dares remove any, and the prompt only ever grows. A one-line comment per example ("resolves relative dates," "covers the rare security-disclosure category") turns the example set from a black box into a maintainable asset. This single habit prevents the slow bloat that afflicts every long-lived few-shot prompt.
Frequently Asked Questions
What is the single most important best practice?
Start zero-shot and require evidence before adding examples. This one habit prevents the majority of cost and stability problems, because it forces a clear instruction and avoids paying for examples that do not help.
How do I write an instruction strong enough to skip examples?
Specify the output format explicitly, name the edge cases and how to handle them, and constrain tone and length. Validate by having a human follow the instruction with no examples — if they succeed, the model usually will too.
When are examples genuinely worth the cost?
When the task has implicit formatting rules, lives in a niche domain the model knows poorly, or requires an unusual output structure that is hard to describe in words. In those cases two or three well-chosen examples can lift accuracy meaningfully.
Should I shuffle examples on every call?
Shuffle when your application can tolerate non-determinism and you want to neutralize order bias. If you need reproducibility, fix the order but balance the labels carefully and audit for recency bias.
How often should I refresh my eval set?
Refresh whenever your input distribution shifts — new customer segments, new formats, seasonal language. A stale eval set hides regressions. Pull fresh real inputs at least quarterly for active systems.
Key Takeaways
- Default to zero-shot; add examples only when an eval proves they pay off.
- The instruction, not the examples, should carry the task specification.
- Treat few-shot examples as a representative sample — balance labels, include hard cases, cap the count.
- Audit for order and recency bias before shipping.
- Instrument accuracy, tokens, and latency before optimizing anything.
- Re-test zero-shot on every model upgrade and delete examples you no longer need.