Your Examples Were Teaching the Model the Wrong Thing

Few-shot prompting is deceptively easy to get wrong. You drop in a couple of examples, the model produces something that looks roughly right, and you ship it. Then two weeks later you notice the outputs have drifted, edge cases are failing, and your team is manually fixing 30% of what the model generates. The examples were doing something—just not what you intended.

The core promise of few-shot prompting is real: giving a language model a small set of demonstrations guides it toward a specific format, tone, reasoning style, or output structure far more reliably than instructions alone. But that guidance is only as good as the examples you choose and the way you arrange them. Most failures trace back to a handful of repeating mistakes that are easy to diagnose once you know what to look for.

This article names seven of those failure modes, explains why each one happens mechanically, what it costs you in production, and the specific corrective practice to apply. Whether you're building client-facing automations or internal workflows, fixing these mistakes is the fastest leverage point in your prompting work.

Mistake 1: Using Examples That Don't Represent Your Actual Distribution

Why it happens

The most natural way to write few-shot examples is to invent clean, idealized cases. You think of a typical input, write a polished output, and move on. The problem is that real inputs are messy—abbreviated, ambiguous, off-topic, or grammatically mangled. Your examples teach the model what you wish inputs looked like, not what they actually look like.

The cost

The model performs well on inputs that resemble its examples and degrades sharply on everything else. In practice that means your failure rate clusters exactly where your users live.

The fix

Pull your examples from real logs, real customer messages, or real documents. If you're pre-launch, simulate realistic variation deliberately: include short inputs and long ones, formal and informal registers, cases with missing information. Aim for examples that span the actual range, not the ideal case. The Few-shot Prompting: Real-World Examples and Use Cases article goes deeper on how to source and categorize examples by input type.

Mistake 2: Letting Your Examples Carry Accidental Patterns

Why it happens

Language models are pattern-matching machines. If every example in your prompt shares a surface feature that isn't part of the intended behavior, the model will generalize that feature as if it were a rule. Common accidental patterns include: all outputs starting with the same word, all inputs being roughly the same length, all examples using the same punctuation style, or all examples coming from the same domain.

The cost

You get brittle outputs that follow the accidental pattern even when it's wrong—outputs that always open with "Sure," for instance, or that truncate at roughly 80 words because all your examples did.

The fix

Audit your examples for unintended regularities before finalizing the prompt. Read your outputs as a column: if you can spot a consistent feature across all of them that you didn't consciously design, that feature will leak into live responses. Vary sentence structures, output lengths, and opening phrases deliberately. If you need outputs to be consistent in some way, make that explicit in the system instruction rather than encoding it implicitly through example uniformity.

Mistake 3: Too Few Examples for the Complexity of the Task

Why it happens

One or two examples feel like enough, especially when you're prototyping fast. And for simple, well-defined tasks—like converting a date to a specific format—they often are. But for tasks that require nuanced judgment, involve multiple output fields, or need to handle diverse input types, two examples leave massive ambiguity.

The cost

The model fills that ambiguity with its own priors, which may or may not match yours. You get inconsistency: the right structure 70% of the time, then something completely off on the 30% of inputs your examples didn't cover.

The fix

A useful heuristic: the number of examples should scale with the number of distinct output behaviors you need the model to demonstrate. If your task has three output categories, include at least two examples per category. If tone can vary, show both ends of the spectrum. For complex structured tasks, six to twelve examples is often a reasonable floor. Review the Few-shot Prompting: Best Practices That Actually Work guide for a systematic way to size your example sets.

Mistake 4: Wrong Example Order

Why it happens

Most practitioners treat example order as arbitrary. It isn't. Research on in-context learning consistently shows that models are disproportionately influenced by examples that appear late in the prompt—the recency effect. If your last example is an outlier, an edge case, or stylistically unusual, it will punch above its weight in shaping outputs.

The cost

Outputs skew toward whatever appeared last, even if that example was meant to represent a rare scenario. You might spend time debugging inconsistency that's caused entirely by prompt order, not example quality.

The fix

Place your most representative, prototypical examples last. Put edge cases and unusual scenarios early. If you're using the prompt as a template where inputs will vary widely, consider randomizing example order during testing to detect order sensitivity before it bites you in production. The A Framework for Few-shot Prompting covers ordering as part of a structured build process.

Mistake 5: Conflating Examples with Instructions

Why it happens

When people are learning few-shot prompting, they often try to teach the model through examples alone—without a clear system-level instruction. They assume the examples will communicate everything: the task, the constraints, the format, the tone. Sometimes they do. Often they don't.

Examples show the model what to do; instructions tell it why and under what conditions. Without instructions, the model has to infer your intent from demonstrations alone, which works for simple tasks and fails for anything conditional or nuanced.

The cost

You get outputs that look right on the surface but violate unstated constraints. The model does what your examples showed, not what your examples meant. For instance, examples of helpful customer service responses won't teach the model to avoid specific liability language unless you say so explicitly.

The fix

Use both. Write a clear system instruction that states the task, any hard constraints, and the decision rules for ambiguous cases. Then use your examples to demonstrate execution, not to carry the full weight of intent. Think of examples as the "how" and instructions as the "what and why." The Few-shot Prompting Checklist for 2026 includes a pre-flight check for exactly this separation.

Mistake 6: Including Inconsistent or Low-Quality Output Examples

Why it happens

Examples get assembled over time, often by multiple people, or adapted from different use cases. The individual examples may each look fine in isolation, but together they set contradictory expectations: one uses bullet points, another uses prose; one is formal, another is casual; one includes a disclaimer, another doesn't.

The cost

The model averages across the inconsistency or oscillates between styles unpredictably. You cannot get consistent outputs from inconsistent demonstrations. This is one of the most common causes of the "works sometimes, fails sometimes" frustration.

The fix

Treat your example set as a product artifact, not a scratch pad. Before deploying, read every example as if you were a model trying to infer a single coherent rule from all of them. If two examples would generate conflicting inferences, revise or remove one. Establish a style guide for your examples the same way you'd establish one for human writers—format, register, length range, required and forbidden elements. Then audit against that guide on any change. See the Case Study: Few-shot Prompting in Practice for a real workflow that treats example management as a repeating process, not a one-time task.

Mistake 7: Not Testing for Label Imbalance

Why it happens

If your task involves classification, routing, or any kind of categorical output, the distribution of labels across your examples matters. If four of your five examples result in "Category A," the model learns that Category A is the default answer. This happens naturally because writers tend to reach for common or easy cases when constructing examples.

The cost

The model over-predicts the majority label even when minority-label inputs are clearly different. In customer support routing, for example, an imbalanced example set might funnel 80% of tickets to the wrong team because the prompt unconsciously taught that bias.

The fix

Count your labels before finalizing any classification prompt. Aim for rough parity across output categories unless you have a deliberate reason to weight otherwise. If your real task has genuinely imbalanced classes—where one outcome is 10x more common—consider whether that imbalance belongs in the examples or in a separate explicit instruction about prior probabilities. Implicit imbalance in examples is a hidden bug; explicit acknowledgment in instructions is a design choice.

Frequently Asked Questions

How many examples do I actually need for few-shot prompting to work?

There's no universal answer, but a practical range for most professional tasks is three to ten examples. Simple formatting tasks can often work with two; complex classification or multi-field extraction tasks usually need six or more. The right number is the minimum that produces consistent, representative outputs across your actual input distribution—test incrementally rather than assuming more is always better.

Does the quality of my examples matter more than the quantity?

Quality matters more, but quantity is still necessary for coverage. A single brilliant example is better than three contradictory ones, but even a perfect example cannot demonstrate enough variation to handle a diverse input set on its own. You need both: coherent, high-quality examples and enough of them to cover the real range of inputs.

Can few-shot prompting fix a fundamentally bad system prompt?

No. Examples amplify and demonstrate what the system prompt defines; they don't substitute for it. If your system prompt is vague or missing, examples will fill the gap with unpredictable generalizations. Fix the instruction layer first, then use examples to sharpen execution.

Should I update my examples over time as I learn more about the task?

Yes, and this is one of the most overlooked practices. Few-shot example sets should be treated as living artifacts: reviewed when you see consistent failures, updated when the task definition changes, and audited periodically for patterns that no longer match real inputs. Set a recurring review cadence, especially for high-volume or client-facing automations.

Why do my outputs seem fine during testing but drift in production?

The most common cause is that your test inputs were similar to your examples and production inputs are not. Another frequent cause is that prompt changes—even small ones like adding a new example—alter the implicit patterns and create unintended side effects. Test with a diverse set that includes inputs you haven't seen before, not just cases that look like your examples.

Key Takeaways

Use real inputs as the basis for your examples, not idealized ones—representative distribution beats polished simplicity.
Audit examples for accidental patterns: uniformity in length, structure, or phrasing that you didn't consciously design will become a rule the model follows.
Scale example count to task complexity; two examples is rarely enough for anything with multiple output types or conditional behavior.
Order matters: put prototypical examples late in the prompt, edge cases early.
Examples show execution; instructions carry intent. Use both, explicitly.
Treat your example set as a maintained artifact—inconsistent examples produce inconsistent outputs, every time.
For classification tasks, count your label distribution and balance it deliberately before the prompt goes live.

Mistake 1: Using Examples That Don't Represent Your Actual Distribution

Why it happens

The cost

The model performs well on inputs that resemble its examples and degrades sharply on everything else. In practice that means your failure rate clusters exactly where your users live.

The fix

Mistake 2: Letting Your Examples Carry Accidental Patterns

Why it happens

The cost

The fix

Mistake 3: Too Few Examples for the Complexity of the Task

Why it happens

The cost

The fix

Mistake 4: Wrong Example Order

Why it happens

The cost

The fix

Mistake 5: Conflating Examples with Instructions

Why it happens

The cost

The fix

Mistake 6: Including Inconsistent or Low-Quality Output Examples

Why it happens

The cost

The fix

Mistake 7: Not Testing for Label Imbalance

Why it happens

The cost

The fix

Frequently Asked Questions

How many examples do I actually need for few-shot prompting to work?

Does the quality of my examples matter more than the quantity?

Can few-shot prompting fix a fundamentally bad system prompt?

Should I update my examples over time as I learn more about the task?

Why do my outputs seem fine during testing but drift in production?

Key Takeaways

Use real inputs as the basis for your examples, not idealized ones—representative distribution beats polished simplicity.
Audit examples for accidental patterns: uniformity in length, structure, or phrasing that you didn't consciously design will become a rule the model follows.
Scale example count to task complexity; two examples is rarely enough for anything with multiple output types or conditional behavior.
Order matters: put prototypical examples late in the prompt, edge cases early.
Examples show execution; instructions carry intent. Use both, explicitly.
Treat your example set as a maintained artifact—inconsistent examples produce inconsistent outputs, every time.
For classification tasks, count your label distribution and balance it deliberately before the prompt goes live.

Your Examples Were Teaching the Model the Wrong Thing

Mistake 1: Using Examples That Don't Represent Your Actual Distribution

Why it happens

The cost

The fix

Mistake 2: Letting Your Examples Carry Accidental Patterns

Why it happens

The cost

The fix

Mistake 3: Too Few Examples for the Complexity of the Task

Why it happens

The cost

The fix

Mistake 4: Wrong Example Order

Why it happens

The cost

The fix

Mistake 5: Conflating Examples with Instructions

Why it happens

The cost

The fix

Mistake 6: Including Inconsistent or Low-Quality Output Examples

Why it happens

The cost

The fix

Mistake 7: Not Testing for Label Imbalance

Why it happens

The cost

The fix

Frequently Asked Questions

How many examples do I actually need for few-shot prompting to work?

Does the quality of my examples matter more than the quantity?

Can few-shot prompting fix a fundamentally bad system prompt?

Should I update my examples over time as I learn more about the task?

Why do my outputs seem fine during testing but drift in production?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Your Examples Were Teaching the Model the Wrong Thing

Mistake 1: Using Examples That Don't Represent Your Actual Distribution

Why it happens

The cost

The fix

Mistake 2: Letting Your Examples Carry Accidental Patterns

Why it happens

The cost

The fix

Mistake 3: Too Few Examples for the Complexity of the Task

Why it happens

The cost

The fix

Mistake 4: Wrong Example Order

Why it happens

The cost

The fix

Mistake 5: Conflating Examples with Instructions

Why it happens

The cost

The fix

Mistake 6: Including Inconsistent or Low-Quality Output Examples

Why it happens

The cost

The fix

Mistake 7: Not Testing for Label Imbalance

Why it happens

The cost

The fix

Frequently Asked Questions

How many examples do I actually need for few-shot prompting to work?

Does the quality of my examples matter more than the quantity?

Can few-shot prompting fix a fundamentally bad system prompt?

Should I update my examples over time as I learn more about the task?

Why do my outputs seem fine during testing but drift in production?

Key Takeaways

Agency Script Editorial

Related Articles