Why Most People Get Mediocre Results From Their Examples

Few-shot prompting is one of the highest-leverage techniques available to anyone working with large language models — and it's consistently underused, mostly because people try it once, get mediocre results, and move on. The reason results are mediocre usually isn't the technique. It's that the examples chosen were vague, inconsistent, or accidentally taught the model the wrong pattern.

The core idea is simple: instead of just telling a model what to do, you show it. You include two to six completed input-output pairs in your prompt before the actual task. The model reads the pattern implicit in those pairs and continues it. What makes few-shot prompting powerful isn't the instruction — it's the demonstration. And what makes demonstrations powerful or useless depends almost entirely on the specific choices you make when constructing them.

This article walks through concrete scenarios across different professional contexts: what worked, what failed, and exactly why. If you've been getting inconsistent output and aren't sure where your prompts are breaking down, the specifics here should make the diagnostic obvious.

Why Examples Outperform Instructions Alone

A well-written instruction tells the model what category of response you want. A well-chosen example shows the model the texture of the response — the tone, the level of detail, the format, the vocabulary register, what gets included and what gets left out.

Instructions tend to be ambiguous in ways neither you nor the model notices until the output arrives wrong. "Write a concise summary" is genuinely ambiguous: concise means three sentences to one person and three paragraphs to another. An example resolves this immediately. If your example summary is three sentences with a specific structure — problem, evidence, implication — the model understands not just the length but the logical architecture you expect.

This is why few-shot prompting works even when your instruction is imprecise. The examples carry the specification weight. For a deeper look at how to structure this intentionally, see A Framework for Few-shot Prompting.

Scenario 1: Client Intake Emails (Agency Operations)

What the team was trying to do

A boutique marketing agency wanted to automate the first-response email when a new client submitted a contact form. They had roughly eight different service categories and wanted each response to feel personalized, reference the specific service the client mentioned, and end with a soft call to action.

The first attempt (failed)

Their initial prompt was instruction-only: "Write a professional email responding to a potential client who submitted a contact form asking about [service]. Be warm, professional, and include a call to action."

Output was generic. It could have come from any agency in any industry. When they tried adding "be specific to our brand," nothing changed — because the model had no examples of what their brand voice actually sounded like.

What worked

They pulled three actual emails a senior account manager had written — one for a branding inquiry, one for paid media, one for web design — and cleaned them into prompt-ready examples. Each example followed the same structure: acknowledge the specific ask, demonstrate expertise with one concrete observation, propose a call.

The new prompt included those three examples before asking the model to generate a fourth for a different service category. The outputs matched brand voice closely enough that the account manager's edits dropped from 20 minutes per email to under five minutes. The key variables that made it work:

Consistent structure across all examples — the model learned a three-beat pattern
Each example addressed a different service — signaling that the pattern generalized across context
Actual language from real emails — the register, vocabulary, and tone were authentic rather than described

Scenario 2: Contract Clause Summarization (Legal/Ops)

The use case

An operations team at a mid-size SaaS company needed to extract liability caps, termination clauses, and payment terms from vendor contracts and output them in a structured format for a master spreadsheet.

Why instruction-only failed

When prompted with "Extract the liability cap, termination clause, and payment terms from this contract and format them as a table," the model was inconsistent. Some outputs included the raw clause text; others summarized. Some summarizations were one sentence; some were three. Column headers varied. The downstream spreadsheet was a mess.

Few-shot fix

They constructed two examples: one from a simple SaaS subscription agreement, one from a more complex enterprise contract with nested indemnification clauses. Each example showed the exact input text (the clause as it appeared in the contract) alongside the exact desired output (three-column row: field name, clause summary in one sentence max, page/section reference).

With those two examples anchoring the prompt, output consistency went from roughly 40% usable to over 90% usable on first pass. Two things drove that outcome:

Output format shown, not described — showing a formatted table row is clearer than "format as a table"
One hard example included — the enterprise contract example taught the model how to handle ambiguity and nesting, which generalized to other complex contracts

This is a pattern worth noting: when your real-world inputs vary in complexity, include at least one example from the harder end of the range.

The task

A DTC food brand wanted 30 captions per week across Instagram and LinkedIn. Different tone per platform — casual and playful on Instagram, more considered and brand-forward on LinkedIn.

The failure mode: tone bleed

Their first approach used a single prompt with two examples, one per platform, and asked for both at once. Output suffered from tone bleed — LinkedIn posts were too casual, Instagram posts were too corporate. The model was averaging across the examples rather than distinguishing the platform variable.

The fix: separate prompts, platform-specific examples

They split into two prompts. The Instagram prompt used three Instagram examples only. The LinkedIn prompt used three LinkedIn examples only. The examples were for the same products — same subject matter, different execution. This taught the model that the variable was platform, not content.

After the split, tone was consistent. The more subtle lesson: when you want the model to treat a variable (platform, audience, register) as a differentiator, keep your examples clean on that variable. Mixing examples across categories forces the model to average rather than distinguish.

For teams building this into a repeatable workflow, The Few-shot Prompting Checklist for 2026 covers exactly this kind of example-hygiene review before you scale a prompt.

Scenario 4: Financial Narrative Writing (Reporting)

The context

A CFO's office needed to turn raw variance data from a monthly P&L into a one-paragraph narrative for the board deck. Finance staff could do it, but it took 45 minutes per section and required a senior reviewer.

Why this use case rewards few-shot specifically

Financial narrative has a specific genre with specific conventions: quantify the variance, explain the driver, forward-project if material, hedge appropriately. This isn't something you can fully capture in instructions without writing a mini style guide. But you can capture it in three examples.

They used narratives from three prior board decks — one favorable variance, one unfavorable, one mixed — and formatted each as: [Raw variance data] → [Board narrative paragraph]. With three examples, the model understood the register, the quantification convention (actual dollars, then percent, then driver), and the hedging language that was appropriate versus alarmist.

The prompt reduced first-draft time from 45 minutes to roughly 8 minutes, with senior review dropping from substantive editing to light approval. The examples did the work that a style guide would have done — but more efficiently.

For a detailed walkthrough of a similar implementation, Case Study: Few-shot Prompting in Practice covers the setup, iteration, and measurement process in depth.

Scenario 5: Customer Support Triage (SaaS Helpdesk)

The task

A support team wanted to classify inbound tickets into five categories (billing, bug, feature request, account access, general inquiry) and assign a priority level (high/medium/low) with a one-line rationale.

What made this tricky

Some tickets were ambiguous — a customer complaining that a feature "stopped working" could be a bug or an account access issue. The team needed the model to make a defensible call rather than punt.

Few-shot design

They selected six examples: one clean case for each category, plus one genuinely ambiguous ticket that the model should classify as bug/high because a key keyword ("production environment") was present. The ambiguous example was critical — it taught the model the decision rule for that edge case without having to write out an explicit rule.

Output accuracy on a 50-ticket test set went from 67% (instruction-only) to 88% (six examples). The remaining errors were mostly genuine ambiguity cases that even human reviewers disagreed on.

The principle: include at least one example from the distribution's hard edge. Clean, easy examples are necessary but insufficient. The model needs to see how to handle the difficult cases.

Common Failure Modes, Systematized

Across these scenarios, the same errors recur:

Example inconsistency — examples that don't share a common format teach the model that format is flexible when it isn't
Too few examples for high-variance tasks — two examples for five output categories leaves four categories underspecified
Mixing variables — examples that differ on more than one dimension force the model to guess which variable matters
Examples from the easy end only — models trained on clean examples fail on edge cases they've never seen demonstrated
Example-instruction mismatch — when the instruction says one thing and the examples implicitly model another, examples usually win; make sure they're aligned

Understanding which failure mode you're hitting tells you exactly what to fix. For teams deciding how many examples to use and in what format, Few-shot Prompting: Trade-offs, Options, and How to Decide covers the decision logic in detail.

Choosing and Curating Your Examples

The best few-shot examples come from real prior outputs that someone already approved. Not invented examples, not idealized examples — real ones, because they carry authentic signal about tone, format, and vocabulary that invented examples often miss.

A practical curation process:

Identify 10–15 real examples of the target output from your existing work
Sort by quality — use your top tier only
Check for format consistency across those top examples; if format varies, standardize it
Ensure coverage of the main input variations the model will encounter
Include at least one difficult or edge-case example
Test with 3 examples first; add more only if output is inconsistent

The right number of examples is typically 3–6. Below 3, patterns are underspecified. Above 6, you're often adding noise or hitting context window costs without improvement. Some tasks — highly structured classification, for instance — benefit from more examples when category coverage matters.

For tooling that helps manage and version example sets at scale, The Best Tools for Few-shot Prompting covers the current landscape.

Frequently Asked Questions

How many examples should a few-shot prompt include?

Three to six is the practical range for most professional use cases. Fewer than three often leaves the pattern underspecified; more than six rarely improves quality and adds context window cost. The exception is classification tasks with many categories — there you may need one solid example per category to ensure adequate coverage.

Can I use few-shot prompting with any large language model?

Yes. The technique works across GPT-4 class models, Claude, Gemini, and open-source alternatives like Llama variants. Smaller models benefit more from clean, consistent examples because they have less implicit knowledge to fill in gaps, so example quality matters even more at the lower-capability end.

What's the difference between few-shot prompting and fine-tuning?

Few-shot prompting places examples directly in the prompt at inference time; fine-tuning bakes examples into the model's weights through additional training. Few-shot is faster and cheaper to iterate; fine-tuning produces more consistent behavior at scale. For most agency workflows, start with few-shot prompting and only consider fine-tuning if you need the same behavior across thousands of daily calls.

What if my examples are confidential or proprietary?

Sanitize them before use — replace client names, financial figures, and identifying details with realistic stand-ins. The model needs the structure and register of the example, not the literal private data. Most enterprises developing internal AI tooling have a sanitization step as standard practice before any real document becomes a prompt example.

Why do my few-shot prompts work well in testing but degrade in production?

Production inputs are usually more varied than your test set. If your examples only covered a narrow slice of the input distribution, the model will perform well on inputs similar to those examples and worse on anything outside that range. Review your production failures, identify which input characteristics they share, and add examples that cover those characteristics.

How do I know if my examples are causing tone bleed or format drift?

Run the same prompt against 10–15 varied inputs and look for variance in the output. If format or tone shifts across outputs in ways your examples don't predict, you likely have inconsistent examples or examples that inadvertently model variation as acceptable. Check whether your examples share a consistent structure, and look for any place where two examples handle the same element differently.

Key Takeaways

Few-shot examples carry specification weight that instructions alone cannot — they show texture, not just category
Inconsistent examples are usually the root cause of inconsistent output; audit your examples before adding more of them
Include at least one example from the hard or ambiguous end of your input distribution, not just the clean cases
When the variable you care about (platform, audience, format) is a differentiator, keep examples clean on that variable — don't mix across it
Real prior outputs, sanitized if necessary, make better examples than invented ones because they carry authentic register
The right number of examples is almost always 3–6; reach for more only when category coverage demands it
Examples and instructions must align — when they conflict, the examples usually win

Why Examples Outperform Instructions Alone

Scenario 1: Client Intake Emails (Agency Operations)

What the team was trying to do

The first attempt (failed)

What worked

Consistent structure across all examples — the model learned a three-beat pattern
Each example addressed a different service — signaling that the pattern generalized across context
Actual language from real emails — the register, vocabulary, and tone were authentic rather than described

Scenario 2: Contract Clause Summarization (Legal/Ops)

The use case

Why instruction-only failed

Few-shot fix

With those two examples anchoring the prompt, output consistency went from roughly 40% usable to over 90% usable on first pass. Two things drove that outcome:

Output format shown, not described — showing a formatted table row is clearer than "format as a table"
One hard example included — the enterprise contract example taught the model how to handle ambiguity and nesting, which generalized to other complex contracts

This is a pattern worth noting: when your real-world inputs vary in complexity, include at least one example from the harder end of the range.

The task

A DTC food brand wanted 30 captions per week across Instagram and LinkedIn. Different tone per platform — casual and playful on Instagram, more considered and brand-forward on LinkedIn.

The failure mode: tone bleed

The fix: separate prompts, platform-specific examples

For teams building this into a repeatable workflow, The Few-shot Prompting Checklist for 2026 covers exactly this kind of example-hygiene review before you scale a prompt.

Scenario 4: Financial Narrative Writing (Reporting)

The context

Why this use case rewards few-shot specifically

For a detailed walkthrough of a similar implementation, Case Study: Few-shot Prompting in Practice covers the setup, iteration, and measurement process in depth.

Scenario 5: Customer Support Triage (SaaS Helpdesk)

The task

What made this tricky

Few-shot design

Output accuracy on a 50-ticket test set went from 67% (instruction-only) to 88% (six examples). The remaining errors were mostly genuine ambiguity cases that even human reviewers disagreed on.

The principle: include at least one example from the distribution's hard edge. Clean, easy examples are necessary but insufficient. The model needs to see how to handle the difficult cases.

Common Failure Modes, Systematized

Across these scenarios, the same errors recur:

Example inconsistency — examples that don't share a common format teach the model that format is flexible when it isn't
Too few examples for high-variance tasks — two examples for five output categories leaves four categories underspecified
Mixing variables — examples that differ on more than one dimension force the model to guess which variable matters
Examples from the easy end only — models trained on clean examples fail on edge cases they've never seen demonstrated
Example-instruction mismatch — when the instruction says one thing and the examples implicitly model another, examples usually win; make sure they're aligned

Choosing and Curating Your Examples

A practical curation process:

Identify 10–15 real examples of the target output from your existing work
Sort by quality — use your top tier only
Check for format consistency across those top examples; if format varies, standardize it
Ensure coverage of the main input variations the model will encounter
Include at least one difficult or edge-case example
Test with 3 examples first; add more only if output is inconsistent

For tooling that helps manage and version example sets at scale, The Best Tools for Few-shot Prompting covers the current landscape.

Frequently Asked Questions

How many examples should a few-shot prompt include?

Can I use few-shot prompting with any large language model?

What's the difference between few-shot prompting and fine-tuning?

What if my examples are confidential or proprietary?

Why do my few-shot prompts work well in testing but degrade in production?

How do I know if my examples are causing tone bleed or format drift?

Key Takeaways

Few-shot examples carry specification weight that instructions alone cannot — they show texture, not just category
Inconsistent examples are usually the root cause of inconsistent output; audit your examples before adding more of them
Include at least one example from the hard or ambiguous end of your input distribution, not just the clean cases
When the variable you care about (platform, audience, format) is a differentiator, keep examples clean on that variable — don't mix across it
Real prior outputs, sanitized if necessary, make better examples than invented ones because they carry authentic register
The right number of examples is almost always 3–6; reach for more only when category coverage demands it
Examples and instructions must align — when they conflict, the examples usually win

Why Most People Get Mediocre Results From Their Examples

Why Examples Outperform Instructions Alone

Scenario 1: Client Intake Emails (Agency Operations)

What the team was trying to do

The first attempt (failed)

What worked

Scenario 2: Contract Clause Summarization (Legal/Ops)

The use case

Why instruction-only failed

Few-shot fix

Scenario 3: Social Media Caption Generation (Content Teams)

The task

The failure mode: tone bleed

The fix: separate prompts, platform-specific examples

Scenario 4: Financial Narrative Writing (Reporting)

The context

Why this use case rewards few-shot specifically

Scenario 5: Customer Support Triage (SaaS Helpdesk)

The task

What made this tricky

Few-shot design

Common Failure Modes, Systematized

Choosing and Curating Your Examples

Frequently Asked Questions

How many examples should a few-shot prompt include?

Can I use few-shot prompting with any large language model?

What's the difference between few-shot prompting and fine-tuning?

What if my examples are confidential or proprietary?

Why do my few-shot prompts work well in testing but degrade in production?

How do I know if my examples are causing tone bleed or format drift?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Why Most People Get Mediocre Results From Their Examples

Why Examples Outperform Instructions Alone

Scenario 1: Client Intake Emails (Agency Operations)

What the team was trying to do

The first attempt (failed)

What worked

Scenario 2: Contract Clause Summarization (Legal/Ops)

The use case

Why instruction-only failed

Few-shot fix

Scenario 3: Social Media Caption Generation (Content Teams)

The task

The failure mode: tone bleed

The fix: separate prompts, platform-specific examples

Scenario 4: Financial Narrative Writing (Reporting)

The context

Why this use case rewards few-shot specifically

Scenario 5: Customer Support Triage (SaaS Helpdesk)

The task

What made this tricky

Few-shot design

Common Failure Modes, Systematized

Choosing and Curating Your Examples

Frequently Asked Questions

How many examples should a few-shot prompt include?

Can I use few-shot prompting with any large language model?

What's the difference between few-shot prompting and fine-tuning?

What if my examples are confidential or proprietary?

Why do my few-shot prompts work well in testing but degrade in production?

How do I know if my examples are causing tone bleed or format drift?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?