Concrete Scenarios Where Model Architecture Changed the Prompt

Principles only stick when you see them operate on something real. This article walks through specific scenarios where a single task met different model architectures and the prompt had to change. Each example shows the task, what the naive prompt did, and the adjustment that fixed it, so you can recognize the pattern when it appears in your own work.

The scenarios are illustrative rather than tied to any one product, and the details are kept concrete so the lesson is unmistakable. The aim is to build the kind of pattern recognition that lets you predict, before you run anything, roughly how a prompt will behave on an unfamiliar model.

For the underlying principles these examples demonstrate, see Cross-Model Prompting Principles Worth Defending. Here we make those principles tangible.

Scenario One: Extracting Fields From an Invoice

The Task

Pull vendor name, invoice number, total, and due date from messy invoice text into structured fields. The same task, run across a verbose chat model and a terse one.

What Happened

On the verbose model, the naive prompt returned the four fields wrapped in three paragraphs of friendly explanation. On the terse model, the same prompt returned the fields cleanly but occasionally dropped one when the source text was ambiguous, offering no placeholder.

The Adjustment

An explicit output contract, return exactly these four named fields as structured data, with null for anything missing, fixed both. The verbose model dropped the preamble; the terse model stopped omitting fields. One explicit contract neutralized two opposite default behaviors, the exact payoff of the explicit-contract practice.

The contract removed dependence on each model's formatting habit
Specifying null-on-missing closed the dropped-field gap
The same contract worked across both verbosity profiles

Scenario Two: A Multi-Step Logic Puzzle

The Task

Solve a constraint problem requiring several reasoning steps, run on a standard chat model and a reasoning-optimized one.

What Happened

On the chat model, a step-by-step instruction helped; without it the model jumped to a wrong answer. On the reasoning model, the same step-by-step instruction produced a longer, more tangled response that was occasionally worse than when the instruction was removed.

The Adjustment

Splitting the prompt by family solved it. The chat model kept the explicit step-by-step cue; the reasoning model got a clean problem statement with the cue removed. This is the over-instruction failure mode from Seven Ways Cross-Model Prompts Quietly Break, caught and corrected.

Scenario Three: Routing a Support Ticket

The Task

Decide which of five teams a support ticket belongs to, run on a chat model and an embedding-based classifier.

What Happened

The chat model accepted a plain-English instruction listing the five teams and classified reasonably. The embedding model rejected that approach entirely; it does not follow instructions, it represents text for similarity comparison. The naive instruction produced an error, not a classification.

The Adjustment

For the embedding model, the team supplied the ticket text and a set of labeled example tickets per team, classifying by nearest match rather than by instruction. Recognizing that the specialized model consumes input, not commands, was the whole fix, an instance of not treating specialized models like chat models.

The chat model took an instruction; the embedding model took example text
The fix was matching the prompt to what each model consumes
Reading the model card upfront would have predicted this

Scenario Four: Summarizing a Long Document

The Task

Summarize a long report, with a specific instruction to preserve the three key recommendations, run on two models with different context behavior.

What Happened

When the preserve-the-recommendations instruction sat in the middle of a long prompt after the document, one model honored it and the other ignored it, dropping the recommendations from the summary. The models attended to mid-context content differently.

The Adjustment

Moving the critical instruction to the very start, before the document, made both models honor it. Position alone fixed the inconsistency, a direct demonstration of placing critical content where attention is reliable across architectures.

Scenario Five: Enforcing a Refusal

The Task

A prompt meant to refuse requests outside a defined scope, run across several models to confirm consistent refusal behavior.

What Happened

Some models refused firmly, others complied with out-of-scope requests when the user phrased them cleverly. The refusal instruction was identical; the models' willingness to be talked out of it varied by architecture and instruction-following strength.

The Adjustment

Strengthening the refusal language and adding an adversarial test set, inputs that try to talk the model out of the rule, exposed which models held and which needed reinforcement. This is where example-driven work meets robustness testing, detailed in Stress-Testing Prompts Before They Reach a Client.

What the Scenarios Have in Common

The Same Diagnostic Loop

Every scenario followed one loop: run the naive prompt, observe the model-specific gap, apply the minimal architecture-aware fix, re-test. The fixes differed, but the diagnostic process was identical, which is the transferable skill.

Architecture Predicts the Gap

In each case the gap was predictable from the model's family. Verbosity from a verbose model, over-reasoning from a reasoning model, instruction rejection from a specialized model. Knowing the family gave a head start on the fix every time.

Empirical Confirmation Closed It

No scenario was resolved by theory alone. Each fix was confirmed by re-running the task and observing the corrected output. The procedure for that confirmation is laid out in A Step-by-Step Approach to Prompting Across Different Model Architectures.

Scenario Six: Generating Marketing Copy in a Fixed Voice

The Task

Produce short product blurbs in a strict brand voice, run on two chat models with different default tones, one chatty and one formal.

What Happened

With only a voice description in the prompt, the chatty model leaned playful and the formal model leaned stiff. Neither matched the brand voice precisely, and the two outputs read as if written by different companies. The voice description alone was too weak a constraint across models with strong, opposite tonal defaults.

The Adjustment

Adding two or three example blurbs in the exact target voice anchored both models far better than any amount of voice description. The examples gave each model a concrete target to imitate, pulling the chatty one toward restraint and the formal one toward warmth. This is the few-shot principle doing the heavy lifting where prose instruction could not.

Examples constrain tone more reliably than adjectives across models
Two or three on-voice samples outperformed a long voice description
The same examples worked on both tonal defaults

Scenario Seven: A Reasoning Model on a Trivial Task

The Task

Capitalize the first letter of each sentence in a paragraph, accidentally routed to a reasoning model instead of a cheap chat model.

What Happened

The reasoning model produced the correct result but spent visible effort deliberating over a task that required none, adding latency and cost for no benefit. It was a reminder that matching the model to the task matters as much as matching the prompt to the model.

The Adjustment

The fix was not a prompt change at all but a routing change: send trivial formatting tasks to a cheap fast model and reserve the reasoning model for problems that actually need reasoning. The lesson connects architecture choice to cost, echoing the routing decision in One Team Migrated a Prompt Across Three Models.

Frequently Asked Questions

Why did an explicit output contract help across opposite verbosity defaults?

Because both verbose and terse defaults are the model formatting on its own. An explicit contract overrides whatever default a model has, so the verbose model stops adding preamble and the terse model stops dropping fields. One contract neutralizes opposite behaviors by removing the dependence on defaults.

When does removing a step-by-step instruction improve results?

On reasoning-optimized models, which already reason internally. The explicit cue forces a visible process that can be worse than the model's built-in one, producing longer and sometimes tangled answers. Removing it and stating the problem cleanly lets the model use its better internal reasoning.

Why did the embedding model reject a plain-English instruction?

Because embedding models represent text for similarity comparison rather than following instructions. A plain-English command does not fit what they consume, so it errors. The fix is to supply the text to classify plus labeled examples and classify by nearest match instead.

How did instruction placement change the summary results?

A critical instruction placed mid-context, after a long document, was honored by one model and ignored by another, because architectures attend to the middle of long inputs differently. Moving it to the start, where attention is reliable, made both models honor it. Position was the entire fix.

A diagnostic loop: run the naive prompt, observe the model-specific gap, apply the minimal architecture-aware fix, and re-test. The specific fixes differ by scenario, but the loop is identical, and learning the loop is more valuable than memorizing any individual fix.

Can I predict the gap before running the prompt?

Often, yes. The gap usually follows from the model's family: verbosity from verbose models, over-reasoning from reasoning models, instruction rejection from specialized models. Knowing the family gives you a strong head start, though empirical confirmation is still required to be sure.

Key Takeaways

An explicit output contract neutralized opposite verbosity defaults across two chat models in the same task.
A step-by-step cue helped a chat model on a logic puzzle but hurt a reasoning model on the same one.
An embedding classifier rejected plain-English instructions and needed labeled example text instead.
Moving a critical instruction to the start fixed inconsistent mid-context attention across models.
Every scenario used the same loop: run naive, observe the family-predictable gap, fix minimally, re-test.
A few on-voice examples constrained tone better than description; a trivial task on a reasoning model called for routing, not prompting.

For the underlying principles these examples demonstrate, see Cross-Model Prompting Principles Worth Defending. Here we make those principles tangible.

Scenario One: Extracting Fields From an Invoice

The Task

Pull vendor name, invoice number, total, and due date from messy invoice text into structured fields. The same task, run across a verbose chat model and a terse one.

What Happened

The Adjustment

The contract removed dependence on each model's formatting habit
Specifying null-on-missing closed the dropped-field gap
The same contract worked across both verbosity profiles

Scenario Two: A Multi-Step Logic Puzzle

The Task

Solve a constraint problem requiring several reasoning steps, run on a standard chat model and a reasoning-optimized one.

What Happened

The Adjustment

Scenario Three: Routing a Support Ticket

The Task

Decide which of five teams a support ticket belongs to, run on a chat model and an embedding-based classifier.

What Happened

The Adjustment

The chat model took an instruction; the embedding model took example text
The fix was matching the prompt to what each model consumes
Reading the model card upfront would have predicted this

Scenario Four: Summarizing a Long Document

The Task

Summarize a long report, with a specific instruction to preserve the three key recommendations, run on two models with different context behavior.

What Happened

The Adjustment

Scenario Five: Enforcing a Refusal

The Task

A prompt meant to refuse requests outside a defined scope, run across several models to confirm consistent refusal behavior.

What Happened

The Adjustment

What the Scenarios Have in Common

The Same Diagnostic Loop

Architecture Predicts the Gap

Empirical Confirmation Closed It

Scenario Six: Generating Marketing Copy in a Fixed Voice

The Task

Produce short product blurbs in a strict brand voice, run on two chat models with different default tones, one chatty and one formal.

What Happened

The Adjustment

Examples constrain tone more reliably than adjectives across models
Two or three on-voice samples outperformed a long voice description
The same examples worked on both tonal defaults

Scenario Seven: A Reasoning Model on a Trivial Task

The Task

Capitalize the first letter of each sentence in a paragraph, accidentally routed to a reasoning model instead of a cheap chat model.

What Happened

The Adjustment

Frequently Asked Questions

Why did an explicit output contract help across opposite verbosity defaults?

When does removing a step-by-step instruction improve results?

Why did the embedding model reject a plain-English instruction?

How did instruction placement change the summary results?

Can I predict the gap before running the prompt?

Key Takeaways

An explicit output contract neutralized opposite verbosity defaults across two chat models in the same task.
A step-by-step cue helped a chat model on a logic puzzle but hurt a reasoning model on the same one.
An embedding classifier rejected plain-English instructions and needed labeled example text instead.
Moving a critical instruction to the start fixed inconsistent mid-context attention across models.
Every scenario used the same loop: run naive, observe the family-predictable gap, fix minimally, re-test.
A few on-voice examples constrained tone better than description; a trivial task on a reasoning model called for routing, not prompting.

Concrete Scenarios Where Model Architecture Changed the Prompt

Scenario One: Extracting Fields From an Invoice

The Task

What Happened

The Adjustment

Scenario Two: A Multi-Step Logic Puzzle

The Task

What Happened

The Adjustment

Scenario Three: Routing a Support Ticket

The Task

What Happened

The Adjustment

Scenario Four: Summarizing a Long Document

The Task

What Happened

The Adjustment

Scenario Five: Enforcing a Refusal

The Task

What Happened

The Adjustment

What the Scenarios Have in Common

The Same Diagnostic Loop

Architecture Predicts the Gap

Empirical Confirmation Closed It

Scenario Six: Generating Marketing Copy in a Fixed Voice

The Task

What Happened

The Adjustment

Scenario Seven: A Reasoning Model on a Trivial Task

The Task

What Happened

The Adjustment

Frequently Asked Questions

Why did an explicit output contract help across opposite verbosity defaults?

When does removing a step-by-step instruction improve results?

Why did the embedding model reject a plain-English instruction?

How did instruction placement change the summary results?

What single process do all these examples share?

Can I predict the gap before running the prompt?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Concrete Scenarios Where Model Architecture Changed the Prompt

Scenario One: Extracting Fields From an Invoice

The Task

What Happened

The Adjustment

Scenario Two: A Multi-Step Logic Puzzle

The Task

What Happened

The Adjustment

Scenario Three: Routing a Support Ticket

The Task

What Happened

The Adjustment

Scenario Four: Summarizing a Long Document

The Task

What Happened

The Adjustment

Scenario Five: Enforcing a Refusal

The Task

What Happened

The Adjustment

What the Scenarios Have in Common

The Same Diagnostic Loop

Architecture Predicts the Gap

Empirical Confirmation Closed It

Scenario Six: Generating Marketing Copy in a Fixed Voice

The Task

What Happened

The Adjustment

Scenario Seven: A Reasoning Model on a Trivial Task

The Task

What Happened

The Adjustment