Decide Few Shot Versus Zero Shot by Measurement, Not Hunch

Most explanations of zero shot and few shot learning stop at definitions and leave you to figure out the doing. This one is a workflow. Follow the steps in order and you'll end with a prompt you've actually tested, not one you hope works.

The whole process rests on a single principle: decide with measurement, not intuition. You will write a zero shot prompt, score it against a tiny test set, and only add examples if the numbers say to. That discipline is what keeps you from paying for few shot tokens you don't need or shipping a brittle prompt that breaks on real inputs.

Set aside 60 to 90 minutes for your first pass. After you've done it once, it becomes a 20-minute habit you'll run for every new task. Let's go step by step.

Step 1: Write the task down as a spec

Before any prompting, write one or two sentences describing exactly what success looks like. Name the input, the output, and the format. "Take a customer review and return exactly one label: positive, neutral, or negative" is a spec. "Analyze reviews" is not.

Why this comes first

If you can't write the spec, no amount of examples will save you — the model can't hit a target you haven't defined. A precise spec is also what makes zero shot work, since most zero shot failures are actually vague-instruction failures in disguise.

Step 2: Build a tiny evaluation set

Hand-pick 15 to 25 representative inputs and write the correct output for each yourself. Include the easy cases and the awkward ones — sarcasm, edge cases, empty inputs. This is your answer key.

Keep it small enough to label by hand in 20 minutes.
Include at least a few hard cases on purpose; they're where techniques diverge.
Store inputs and expected outputs side by side so scoring is fast.

This set is the most valuable artifact in the whole process. Skip it and you're back to guessing.

Step 3: Run zero shot and score it

Write a clean zero shot prompt: your spec plus the input, no examples. Run it against all 15 to 25 cases and count how many match your answer key. Write down the score.

If you hit 90% or better and the failures are minor, you may be done — zero shot is cheaper and easier to maintain, so don't add complexity you don't need. If you're below that, note what kind of error you're seeing, because that determines your next move.

Step 4: Diagnose the failures

Look at what went wrong and sort it into one of two buckets:

Content errors — the model misunderstood the task or reasoned wrong. Fix the instruction: add clarity, definitions, or constraints. Examples won't fix a misunderstanding.
Format or edge-case errors — the model gets the gist but mangles the output shape or fumbles tricky inputs. This is the signal to add examples.

This diagnosis is the hinge of the entire workflow. The framework article formalizes this decision if you want it as a reusable diagram.

Step 5: Add few shot examples deliberately

If step 4 pointed to format or edge cases, add two to three examples. Choose them on purpose:

Pick examples that demonstrate the exact failure you saw — if it fumbled sarcasm, include a sarcastic example labeled correctly.
Make them representative, not cherry-picked easy cases.
Keep your labels perfectly consistent; one wrong example poisons the pattern.
Balance the categories so you don't accidentally teach a default.

Re-run the full eval set and compare the new score to your zero shot baseline.

Step 6: Test two, then four examples

Don't assume more is better. Run the eval at two examples, then four, and chart the scores. You'll usually see accuracy climb then plateau — or even dip if examples start conflicting. Pick the smallest example count that captures most of the gain, because every example adds cost and latency to every future call.

If four examples barely beats two, ship two. If accuracy is still climbing at six, your task may be a fine-tuning candidate rather than a few shot one. The complete guide covers when to make that jump.

Step 7: Lock it in and document

Save the winning prompt, your eval set, and the scores together. When the task changes later, you re-run the same eval and know immediately whether your change helped. This turns prompting from art into something you can maintain.

For patterns that consistently hold up in production, read best practices that actually work. And before you finalize, run the 2026 checklist to catch anything you missed.

Step 8: Watch for the silent failure modes

A prompt that scores well on your eval set can still misbehave in ways the set didn't catch. Two are worth a deliberate check before you call it done.

Order sensitivity. Models can weight later examples more heavily than earlier ones. If all your "positive" examples sit at the bottom of the prompt, the model may drift toward predicting positive. Shuffle the order of your examples and re-run the eval; if the score moves, you have an order-bias problem and should balance the arrangement.

Label imbalance. If four of your five examples share one category, you've quietly taught the model a default. It'll lean toward that category on ambiguous inputs. Count your example labels and even them out unless your real-world distribution genuinely is that lopsided.

Re-checking after task changes

Tasks drift. A label gets renamed, a new edge case appears, the format changes slightly. When that happens, don't eyeball it — re-run your saved eval set against the updated prompt and compare to the recorded baseline. This is the entire payoff of building the eval set in step 2: change becomes measurable instead of scary. The common mistakes article catalogs the failures that slip past teams who skip this re-check.

A quick worked pass

Say you're labeling support tickets. Spec: one of three labels. Zero shot scores 78% — failures are all formatting (model writes sentences instead of a label). That's a format error, so you add three examples showing bare-label output. Score jumps to 94% at three examples, 95% at five. You ship three examples, document the 17-point gain, and move on. That's the whole loop in real terms.

Frequently Asked Questions

How big does my evaluation set really need to be?

Fifteen to 25 hand-labeled cases is enough to make a confident first decision. The point is to catch the difference between techniques, not to be statistically rigorous. You can expand it later if a task becomes high-stakes.

What if zero shot and few shot score the same?

Ship zero shot. It's cheaper, faster, and easier to maintain. Add examples only when they buy a clear, measurable improvement — a tie means the examples are dead weight.

Should I fix the instruction or add examples first?

Fix the instruction first when failures are about content or understanding. Only add examples when the model understands the task but gets the output format or edge cases wrong. Diagnosing this in step 4 saves you from solving the wrong problem.

How do I know when to stop adding examples?

When the score stops climbing meaningfully. Test two and four examples and watch for the plateau. Pick the smallest count that captures most of the gain, since each example costs tokens on every call.

Can I automate this whole workflow?

Yes. Once you have an eval set with expected outputs, you can script the runs and scoring so every prompt change is graded automatically. Start by hand to build intuition, then automate the repetitive parts.

Key Takeaways

Write a precise spec first — most zero shot failures are really vague-instruction failures.
Build a small hand-labeled eval set; it's the most valuable artifact in the process.
Run zero shot, score it, and only add examples if failures are about format or edge cases.
Test two then four examples to find the plateau; ship the smallest count that captures the gain.
Save the prompt, eval set, and scores together so future changes can be measured, not guessed.

Set aside 60 to 90 minutes for your first pass. After you've done it once, it becomes a 20-minute habit you'll run for every new task. Let's go step by step.

Step 1: Write the task down as a spec

Why this comes first

Step 2: Build a tiny evaluation set

Hand-pick 15 to 25 representative inputs and write the correct output for each yourself. Include the easy cases and the awkward ones — sarcasm, edge cases, empty inputs. This is your answer key.

Keep it small enough to label by hand in 20 minutes.
Include at least a few hard cases on purpose; they're where techniques diverge.
Store inputs and expected outputs side by side so scoring is fast.

This set is the most valuable artifact in the whole process. Skip it and you're back to guessing.

Step 3: Run zero shot and score it

Write a clean zero shot prompt: your spec plus the input, no examples. Run it against all 15 to 25 cases and count how many match your answer key. Write down the score.

Step 4: Diagnose the failures

Look at what went wrong and sort it into one of two buckets:

Content errors — the model misunderstood the task or reasoned wrong. Fix the instruction: add clarity, definitions, or constraints. Examples won't fix a misunderstanding.
Format or edge-case errors — the model gets the gist but mangles the output shape or fumbles tricky inputs. This is the signal to add examples.

This diagnosis is the hinge of the entire workflow. The framework article formalizes this decision if you want it as a reusable diagram.

Step 5: Add few shot examples deliberately

If step 4 pointed to format or edge cases, add two to three examples. Choose them on purpose:

Pick examples that demonstrate the exact failure you saw — if it fumbled sarcasm, include a sarcastic example labeled correctly.
Make them representative, not cherry-picked easy cases.
Keep your labels perfectly consistent; one wrong example poisons the pattern.
Balance the categories so you don't accidentally teach a default.

Re-run the full eval set and compare the new score to your zero shot baseline.

Step 6: Test two, then four examples

Step 7: Lock it in and document

For patterns that consistently hold up in production, read best practices that actually work. And before you finalize, run the 2026 checklist to catch anything you missed.

Step 8: Watch for the silent failure modes

A prompt that scores well on your eval set can still misbehave in ways the set didn't catch. Two are worth a deliberate check before you call it done.

Re-checking after task changes

A quick worked pass

Frequently Asked Questions

How big does my evaluation set really need to be?

What if zero shot and few shot score the same?

Ship zero shot. It's cheaper, faster, and easier to maintain. Add examples only when they buy a clear, measurable improvement — a tie means the examples are dead weight.

Should I fix the instruction or add examples first?

How do I know when to stop adding examples?

Can I automate this whole workflow?

Key Takeaways

Write a precise spec first — most zero shot failures are really vague-instruction failures.
Build a small hand-labeled eval set; it's the most valuable artifact in the process.
Run zero shot, score it, and only add examples if failures are about format or edge cases.
Test two then four examples to find the plateau; ship the smallest count that captures the gain.
Save the prompt, eval set, and scores together so future changes can be measured, not guessed.

Decide Few Shot Versus Zero Shot by Measurement, Not Hunch

Step 1: Write the task down as a spec

Why this comes first

Step 2: Build a tiny evaluation set

Step 3: Run zero shot and score it

Step 4: Diagnose the failures

Step 5: Add few shot examples deliberately

Step 6: Test two, then four examples

Step 7: Lock it in and document

Step 8: Watch for the silent failure modes

Re-checking after task changes

A quick worked pass

Frequently Asked Questions

How big does my evaluation set really need to be?

What if zero shot and few shot score the same?

Should I fix the instruction or add examples first?

How do I know when to stop adding examples?

Can I automate this whole workflow?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Decide Few Shot Versus Zero Shot by Measurement, Not Hunch

Step 1: Write the task down as a spec

Why this comes first

Step 2: Build a tiny evaluation set

Step 3: Run zero shot and score it

Step 4: Diagnose the failures

Step 5: Add few shot examples deliberately

Step 6: Test two, then four examples

Step 7: Lock it in and document

Step 8: Watch for the silent failure modes

Re-checking after task changes

A quick worked pass

Frequently Asked Questions

How big does my evaluation set really need to be?

What if zero shot and few shot score the same?

Should I fix the instruction or add examples first?

How do I know when to stop adding examples?

Can I automate this whole workflow?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?