Cargo-Cult Advice Has Buried What Actually Works in Prompts

Few-shot prompting has a reputation problem — not because it doesn't work, but because the conversation around it has accumulated a thick layer of half-truths, cargo-cult advice, and wishful thinking. Practitioners copy examples from Twitter threads, notice mixed results, and either over-trust the technique or abandon it entirely. Neither response is right.

The core idea is simple: you give a language model a handful of input-output examples before your actual request, and the model learns your intent from the pattern rather than from explicit instructions alone. That simplicity is part of the problem. Because the concept is easy to state, people assume it's easy to reason about — and they stop questioning the folk wisdom that surrounds it.

This article goes through the most consequential myths about few-shot prompting, explains what the evidence and practical experience actually show, and gives you a more accurate mental model for using the technique well. If you're building prompts for client deliverables, internal tools, or AI-assisted workflows, getting this right will save you hours of debugging and a lot of embarrassment.

Myth 1: More Examples Always Improve Performance

This is the single most pervasive misconception. The intuition feels airtight — more data means better generalization — but few-shot prompting is not fine-tuning. You're not training the model. You're priming its attention.

What actually happens: performance gains from adding examples typically plateau, often somewhere between three and eight shots depending on task complexity and model size. Beyond that plateau, additional examples can hurt. They dilute the signal, push critical instruction text further from the generation point, and in models with limited context windows, they crowd out the actual content you need processed.

The Quality-Quantity Trade-off

A single high-quality, representative example often outperforms five mediocre ones. "High-quality" means:

The example matches the exact format, tone, and scope of outputs you want
The input is realistic, not idealized
Edge cases are represented, not just clean-room scenarios

If you're seeing inconsistent outputs, the instinct to "add more examples" is usually wrong. The right move is to audit the examples you already have for clarity and coverage.

Myth 2: Example Order Doesn't Matter

It does — sometimes dramatically. Research and practitioner testing both show that models are recency-biased: the final example before the query tends to exert disproportionate influence on the output. If your last example is an outlier in tone or structure, the model will lean toward it.

This isn't a bug you can ignore. It means your example ordering is an active design decision, not an afterthought.

What Good Ordering Looks Like

Put your most representative, "default" example last
If you're covering multiple subtypes of a task, don't cluster them — interleave so no single type dominates the recency window
When testing, rotate example order and check whether outputs shift significantly; if they do, your examples need to be more consistent with each other

This is one of the practical details that The Few-shot Prompting Playbook covers in depth, particularly for structured output tasks where format consistency is non-negotiable.

Myth 3: Few-shot Prompting Replaces Clear Instructions

Some practitioners treat examples as a substitute for writing explicit instructions. The reasoning: "Instead of explaining what I want, I'll just show it." This works in narrow cases and fails badly in others.

Examples teach format and surface patterns. They do not reliably teach:

Reasoning rules — why one output is correct when similar inputs should produce different ones
Constraints — things the model should never do, which are hard to demonstrate by omission
Prioritization — when two goals conflict, which wins

A model shown five examples of concise, friendly customer emails will produce concise, friendly emails — until it hits a situation not well-represented in your examples. At that point, it will generalize from the pattern, and that generalization may not align with your actual policy.

The Hybrid Approach

The most robust prompts combine explicit instruction with examples. Instructions carry the rules; examples carry the tone, format, and register. Neither alone is as reliable as both together. If you've been relying entirely on examples and getting unpredictable edge-case outputs, adding a tight instruction block above your examples will often solve it immediately.

Myth 4: Zero-shot Is Simpler, So Few-shot Is Only Worth It for Hard Tasks

The implied logic here is that you should escalate to few-shot only when zero-shot fails. This frames the techniques as rungs on a difficulty ladder. They're not — they're different tools for different problems.

Zero-shot is genuinely better when:

The task is well within the model's training distribution (e.g., basic summarization, common translation pairs)
You need the model to exercise open-ended judgment without anchoring it to specific patterns
You're operating in a context where prompt length is tightly constrained

Few-shot is better when:

Format adherence is critical and non-obvious
You're working in a specialized domain with idiosyncratic conventions
Output variability is a real problem and you need to narrow the distribution of responses

The mistake is treating zero-shot as the default and few-shot as the exception. For production workflows where consistency matters — agency deliverables, client-facing automation, data extraction pipelines — few-shot is often the baseline you should start with, not fall back on.

Myth 5: Your Examples Don't Need to Be from Real Data

Crafting synthetic examples from imagination is tempting. It's fast, and you have full control. But synthetic examples carry a specific failure mode: they tend to be too clean, too cooperative, and too representative of the task as you imagine it rather than as it actually arrives.

Real inputs are messy. They contain ambiguous phrasing, missing context, formatting inconsistencies, and edge cases you didn't anticipate. If your examples are all polished hypotheticals, the model learns to handle polished hypotheticals — and stumbles when it encounters what your users actually send.

How to Source Better Examples

Pull from actual past outputs you've verified as correct, even if you have to anonymize them
If you have no real data yet, stress-test synthetic examples by deliberately introducing the kinds of messiness you expect in production
Treat your example set as a living asset, not a one-time artifact — update it as you discover failure modes

This connects directly to Building a Repeatable Workflow for Few-shot Prompting, which lays out a systematic approach for collecting, curating, and versioning examples over time.

Myth 6: Few-shot Prompting Works the Same Across All Models

Model architecture, size, and training regimen all affect how a model responds to few-shot examples. What works well on GPT-4o may produce flat or inconsistent results on a smaller open-source model — not because the technique is wrong, but because the model's capacity for in-context learning differs.

Smaller models tend to be more sensitive to example formatting and order. They may also be more likely to parrot example phrasing verbatim rather than generalizing from it. Frontier models handle messier, more abstract examples better, but they also have stronger prior opinions about format and style that can resist your examples if those examples conflict with common patterns in training data.

Practical Implications

Always validate a prompting strategy on the specific model you're deploying, not on the one you tested during development
If you're migrating a prompt from one model to another, treat example tuning as a required step, not an assumption
Pay attention to how the model handles the delimiter and separator formatting in your examples — this is surprisingly model-specific

Myth 7: Chain-of-Thought and Few-shot Are the Same Thing

These techniques overlap but are not synonymous. Few-shot prompting provides input-output examples to establish a pattern. Chain-of-thought (CoT) prompting uses examples where the reasoning steps are made explicit — the intermediate thinking, not just the final answer.

You can use chain-of-thought as a few-shot technique (showing worked examples with reasoning), but you can also use it zero-shot (simply prompting the model to "think step by step" without any examples). Conflating the two leads to muddled prompt design.

The distinction matters because they solve different problems. Few-shot examples without reasoning traces are excellent for format and tone consistency. CoT examples are necessary when the task requires multi-step inference that the model won't get right without a scaffolded reasoning path. The Complete Guide to Chain-of-thought Prompting walks through when and how to deploy CoT effectively — it's worth reading in conjunction with this article if your tasks involve reasoning-heavy outputs.

Myth 8: If Few-shot Isn't Working, You Need a Better Model

When few-shot prompting produces poor results, the common escalation is to assume model inadequacy and reach for a more expensive option. This is often the wrong diagnosis.

Before blaming the model, run through this checklist:

Are your examples actually representative? Garbage examples produce garbage generalization regardless of model capability.
Is your input format consistent with your examples? Even subtle formatting mismatches — different delimiters, different label styles — can confuse the model's pattern-matching.
Are you presenting too many distinct patterns? If your examples cover multiple subtasks, the model may not know which pattern to apply.
Have you tested with fewer examples? As noted above, more is not always better.

In most cases where practitioners switch to a larger model and see improvement, the improvement came from the model's greater tolerance for imperfect prompts — not from the model actually being necessary for the task. Fixing the prompt would have solved it on the original model, cheaper and faster.

Frequently Asked Questions

How many examples should I use in a few-shot prompt?

There is no universal answer, but three to six examples covers the majority of tasks well. Start with three, evaluate output quality and consistency, and add examples only if you're seeing genuine gaps in coverage — not just as a reflex. More than eight examples rarely improves performance and often degrades it by introducing noise.

Can few-shot prompting teach a model to follow a new factual domain it wasn't trained on?

No. Few-shot prompting is an in-context learning technique, not a knowledge injection method. It shapes how the model responds — format, tone, reasoning style — but it cannot reliably teach facts the model doesn't already know. If domain knowledge is the gap, retrieval-augmented generation or fine-tuning is the right tool.

Does few-shot prompting work better for some task types than others?

Yes. It performs best on tasks with clear, learnable output patterns: classification, structured extraction, reformatting, style transfer. It is less reliably helpful for open-ended creative tasks, complex multi-step reasoning (where chain-of-thought is often more useful), and tasks where variability is actually desirable. See Few-shot Prompting: The Questions Everyone Asks, Answered for a more complete breakdown by task type.

Will few-shot prompting remain relevant as models continue to improve?

Almost certainly, though the technique will evolve. Stronger instruction-following in frontier models reduces the gap between zero-shot and few-shot performance on common tasks — but for idiosyncratic, format-critical, or domain-specific work, examples will continue to provide value that instructions alone don't. The Future of Few-shot Prompting explores how the role of examples is likely to shift as model capabilities advance.

Is it safe to include real customer data in few-shot examples?

Not without careful handling. Examples are part of your prompt and may be logged, cached, or otherwise processed by your model provider. Use anonymized or synthetic stand-ins for any personally identifiable or confidential information, even if the underlying example structure came from real data.

Key Takeaways

More examples do not reliably improve few-shot performance; quality and representativeness matter more than quantity.
Example order is a design decision — recency bias means your last example exerts the most influence.
Examples teach format and pattern; they cannot replace explicit instructions for communicating rules, constraints, and priorities.
Zero-shot and few-shot are different tools, not rungs on a difficulty ladder — choose based on task type, not as an escalation path.
Synthetic examples carry a cleanness bias; real, messy examples from actual use cases produce more robust generalization.
Few-shot behavior is model-specific — always validate on your deployment model, not your test model.
Chain-of-thought and few-shot overlap but solve different problems; understand the distinction before combining them.
Poor few-shot results usually indicate a prompt problem, not a model inadequacy — fix the examples before reaching for a larger model.

Myth 1: More Examples Always Improve Performance

The Quality-Quantity Trade-off

A single high-quality, representative example often outperforms five mediocre ones. "High-quality" means:

The example matches the exact format, tone, and scope of outputs you want
The input is realistic, not idealized
Edge cases are represented, not just clean-room scenarios

If you're seeing inconsistent outputs, the instinct to "add more examples" is usually wrong. The right move is to audit the examples you already have for clarity and coverage.

Myth 2: Example Order Doesn't Matter

This isn't a bug you can ignore. It means your example ordering is an active design decision, not an afterthought.

What Good Ordering Looks Like

Put your most representative, "default" example last
If you're covering multiple subtypes of a task, don't cluster them — interleave so no single type dominates the recency window
When testing, rotate example order and check whether outputs shift significantly; if they do, your examples need to be more consistent with each other

This is one of the practical details that The Few-shot Prompting Playbook covers in depth, particularly for structured output tasks where format consistency is non-negotiable.

Myth 3: Few-shot Prompting Replaces Clear Instructions

Examples teach format and surface patterns. They do not reliably teach:

Reasoning rules — why one output is correct when similar inputs should produce different ones
Constraints — things the model should never do, which are hard to demonstrate by omission
Prioritization — when two goals conflict, which wins

The Hybrid Approach

Myth 4: Zero-shot Is Simpler, So Few-shot Is Only Worth It for Hard Tasks

Zero-shot is genuinely better when:

The task is well within the model's training distribution (e.g., basic summarization, common translation pairs)
You need the model to exercise open-ended judgment without anchoring it to specific patterns
You're operating in a context where prompt length is tightly constrained

Few-shot is better when:

Format adherence is critical and non-obvious
You're working in a specialized domain with idiosyncratic conventions
Output variability is a real problem and you need to narrow the distribution of responses

Myth 5: Your Examples Don't Need to Be from Real Data

How to Source Better Examples

Pull from actual past outputs you've verified as correct, even if you have to anonymize them
If you have no real data yet, stress-test synthetic examples by deliberately introducing the kinds of messiness you expect in production
Treat your example set as a living asset, not a one-time artifact — update it as you discover failure modes

This connects directly to Building a Repeatable Workflow for Few-shot Prompting, which lays out a systematic approach for collecting, curating, and versioning examples over time.

Myth 6: Few-shot Prompting Works the Same Across All Models

Practical Implications

Always validate a prompting strategy on the specific model you're deploying, not on the one you tested during development
If you're migrating a prompt from one model to another, treat example tuning as a required step, not an assumption
Pay attention to how the model handles the delimiter and separator formatting in your examples — this is surprisingly model-specific

Myth 7: Chain-of-Thought and Few-shot Are the Same Thing

Myth 8: If Few-shot Isn't Working, You Need a Better Model

When few-shot prompting produces poor results, the common escalation is to assume model inadequacy and reach for a more expensive option. This is often the wrong diagnosis.

Before blaming the model, run through this checklist:

Are your examples actually representative? Garbage examples produce garbage generalization regardless of model capability.
Is your input format consistent with your examples? Even subtle formatting mismatches — different delimiters, different label styles — can confuse the model's pattern-matching.
Are you presenting too many distinct patterns? If your examples cover multiple subtasks, the model may not know which pattern to apply.
Have you tested with fewer examples? As noted above, more is not always better.

Frequently Asked Questions

How many examples should I use in a few-shot prompt?

Can few-shot prompting teach a model to follow a new factual domain it wasn't trained on?

Does few-shot prompting work better for some task types than others?

Will few-shot prompting remain relevant as models continue to improve?

Is it safe to include real customer data in few-shot examples?

Key Takeaways

More examples do not reliably improve few-shot performance; quality and representativeness matter more than quantity.
Example order is a design decision — recency bias means your last example exerts the most influence.
Examples teach format and pattern; they cannot replace explicit instructions for communicating rules, constraints, and priorities.
Zero-shot and few-shot are different tools, not rungs on a difficulty ladder — choose based on task type, not as an escalation path.
Synthetic examples carry a cleanness bias; real, messy examples from actual use cases produce more robust generalization.
Few-shot behavior is model-specific — always validate on your deployment model, not your test model.
Chain-of-thought and few-shot overlap but solve different problems; understand the distinction before combining them.
Poor few-shot results usually indicate a prompt problem, not a model inadequacy — fix the examples before reaching for a larger model.

Cargo-Cult Advice Has Buried What Actually Works in Prompts

Myth 1: More Examples Always Improve Performance

The Quality-Quantity Trade-off

Myth 2: Example Order Doesn't Matter

What Good Ordering Looks Like

Myth 3: Few-shot Prompting Replaces Clear Instructions

The Hybrid Approach

Myth 4: Zero-shot Is Simpler, So Few-shot Is Only Worth It for Hard Tasks

Myth 5: Your Examples Don't Need to Be from Real Data

How to Source Better Examples

Myth 6: Few-shot Prompting Works the Same Across All Models

Practical Implications

Myth 7: Chain-of-Thought and Few-shot Are the Same Thing

Myth 8: If Few-shot Isn't Working, You Need a Better Model

Frequently Asked Questions

How many examples should I use in a few-shot prompt?

Can few-shot prompting teach a model to follow a new factual domain it wasn't trained on?

Does few-shot prompting work better for some task types than others?

Will few-shot prompting remain relevant as models continue to improve?

Is it safe to include real customer data in few-shot examples?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Cargo-Cult Advice Has Buried What Actually Works in Prompts

Myth 1: More Examples Always Improve Performance

The Quality-Quantity Trade-off

Myth 2: Example Order Doesn't Matter

What Good Ordering Looks Like

Myth 3: Few-shot Prompting Replaces Clear Instructions

The Hybrid Approach

Myth 4: Zero-shot Is Simpler, So Few-shot Is Only Worth It for Hard Tasks

Myth 5: Your Examples Don't Need to Be from Real Data

How to Source Better Examples

Myth 6: Few-shot Prompting Works the Same Across All Models

Practical Implications

Myth 7: Chain-of-Thought and Few-shot Are the Same Thing

Myth 8: If Few-shot Isn't Working, You Need a Better Model

Frequently Asked Questions

How many examples should I use in a few-shot prompt?

Can few-shot prompting teach a model to follow a new factual domain it wasn't trained on?

Does few-shot prompting work better for some task types than others?

Will few-shot prompting remain relevant as models continue to improve?

Is it safe to include real customer data in few-shot examples?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?