Principles only stick when you see them operate on something real. This article walks through specific scenarios where a single task met different model architectures and the prompt had to change. Each example shows the task, what the naive prompt did, and the adjustment that fixed it, so you can recognize the pattern when it appears in your own work.
The scenarios are illustrative rather than tied to any one product, and the details are kept concrete so the lesson is unmistakable. The aim is to build the kind of pattern recognition that lets you predict, before you run anything, roughly how a prompt will behave on an unfamiliar model.
For the underlying principles these examples demonstrate, see Cross-Model Prompting Principles Worth Defending. Here we make those principles tangible.
Scenario One: Extracting Fields From an Invoice
The Task
Pull vendor name, invoice number, total, and due date from messy invoice text into structured fields. The same task, run across a verbose chat model and a terse one.
What Happened
On the verbose model, the naive prompt returned the four fields wrapped in three paragraphs of friendly explanation. On the terse model, the same prompt returned the fields cleanly but occasionally dropped one when the source text was ambiguous, offering no placeholder.
The Adjustment
An explicit output contract, return exactly these four named fields as structured data, with null for anything missing, fixed both. The verbose model dropped the preamble; the terse model stopped omitting fields. One explicit contract neutralized two opposite default behaviors, the exact payoff of the explicit-contract practice.
- The contract removed dependence on each model's formatting habit
- Specifying null-on-missing closed the dropped-field gap
- The same contract worked across both verbosity profiles
Scenario Two: A Multi-Step Logic Puzzle
The Task
Solve a constraint problem requiring several reasoning steps, run on a standard chat model and a reasoning-optimized one.
What Happened
On the chat model, a step-by-step instruction helped; without it the model jumped to a wrong answer. On the reasoning model, the same step-by-step instruction produced a longer, more tangled response that was occasionally worse than when the instruction was removed.
The Adjustment
Splitting the prompt by family solved it. The chat model kept the explicit step-by-step cue; the reasoning model got a clean problem statement with the cue removed. This is the over-instruction failure mode from Seven Ways Cross-Model Prompts Quietly Break, caught and corrected.
Scenario Three: Routing a Support Ticket
The Task
Decide which of five teams a support ticket belongs to, run on a chat model and an embedding-based classifier.
What Happened
The chat model accepted a plain-English instruction listing the five teams and classified reasonably. The embedding model rejected that approach entirely; it does not follow instructions, it represents text for similarity comparison. The naive instruction produced an error, not a classification.
The Adjustment
For the embedding model, the team supplied the ticket text and a set of labeled example tickets per team, classifying by nearest match rather than by instruction. Recognizing that the specialized model consumes input, not commands, was the whole fix, an instance of not treating specialized models like chat models.
- The chat model took an instruction; the embedding model took example text
- The fix was matching the prompt to what each model consumes
- Reading the model card upfront would have predicted this
Scenario Four: Summarizing a Long Document
The Task
Summarize a long report, with a specific instruction to preserve the three key recommendations, run on two models with different context behavior.
What Happened
When the preserve-the-recommendations instruction sat in the middle of a long prompt after the document, one model honored it and the other ignored it, dropping the recommendations from the summary. The models attended to mid-context content differently.
The Adjustment
Moving the critical instruction to the very start, before the document, made both models honor it. Position alone fixed the inconsistency, a direct demonstration of placing critical content where attention is reliable across architectures.
Scenario Five: Enforcing a Refusal
The Task
A prompt meant to refuse requests outside a defined scope, run across several models to confirm consistent refusal behavior.
What Happened
Some models refused firmly, others complied with out-of-scope requests when the user phrased them cleverly. The refusal instruction was identical; the models' willingness to be talked out of it varied by architecture and instruction-following strength.
The Adjustment
Strengthening the refusal language and adding an adversarial test set, inputs that try to talk the model out of the rule, exposed which models held and which needed reinforcement. This is where example-driven work meets robustness testing, detailed in Stress-Testing Prompts Before They Reach a Client.
What the Scenarios Have in Common
The Same Diagnostic Loop
Every scenario followed one loop: run the naive prompt, observe the model-specific gap, apply the minimal architecture-aware fix, re-test. The fixes differed, but the diagnostic process was identical, which is the transferable skill.
Architecture Predicts the Gap
In each case the gap was predictable from the model's family. Verbosity from a verbose model, over-reasoning from a reasoning model, instruction rejection from a specialized model. Knowing the family gave a head start on the fix every time.
Empirical Confirmation Closed It
No scenario was resolved by theory alone. Each fix was confirmed by re-running the task and observing the corrected output. The procedure for that confirmation is laid out in A Step-by-Step Approach to Prompting Across Different Model Architectures.
Scenario Six: Generating Marketing Copy in a Fixed Voice
The Task
Produce short product blurbs in a strict brand voice, run on two chat models with different default tones, one chatty and one formal.
What Happened
With only a voice description in the prompt, the chatty model leaned playful and the formal model leaned stiff. Neither matched the brand voice precisely, and the two outputs read as if written by different companies. The voice description alone was too weak a constraint across models with strong, opposite tonal defaults.
The Adjustment
Adding two or three example blurbs in the exact target voice anchored both models far better than any amount of voice description. The examples gave each model a concrete target to imitate, pulling the chatty one toward restraint and the formal one toward warmth. This is the few-shot principle doing the heavy lifting where prose instruction could not.
- Examples constrain tone more reliably than adjectives across models
- Two or three on-voice samples outperformed a long voice description
- The same examples worked on both tonal defaults
Scenario Seven: A Reasoning Model on a Trivial Task
The Task
Capitalize the first letter of each sentence in a paragraph, accidentally routed to a reasoning model instead of a cheap chat model.
What Happened
The reasoning model produced the correct result but spent visible effort deliberating over a task that required none, adding latency and cost for no benefit. It was a reminder that matching the model to the task matters as much as matching the prompt to the model.
The Adjustment
The fix was not a prompt change at all but a routing change: send trivial formatting tasks to a cheap fast model and reserve the reasoning model for problems that actually need reasoning. The lesson connects architecture choice to cost, echoing the routing decision in One Team Migrated a Prompt Across Three Models.
Frequently Asked Questions
Why did an explicit output contract help across opposite verbosity defaults?
Because both verbose and terse defaults are the model formatting on its own. An explicit contract overrides whatever default a model has, so the verbose model stops adding preamble and the terse model stops dropping fields. One contract neutralizes opposite behaviors by removing the dependence on defaults.
When does removing a step-by-step instruction improve results?
On reasoning-optimized models, which already reason internally. The explicit cue forces a visible process that can be worse than the model's built-in one, producing longer and sometimes tangled answers. Removing it and stating the problem cleanly lets the model use its better internal reasoning.
Why did the embedding model reject a plain-English instruction?
Because embedding models represent text for similarity comparison rather than following instructions. A plain-English command does not fit what they consume, so it errors. The fix is to supply the text to classify plus labeled examples and classify by nearest match instead.
How did instruction placement change the summary results?
A critical instruction placed mid-context, after a long document, was honored by one model and ignored by another, because architectures attend to the middle of long inputs differently. Moving it to the start, where attention is reliable, made both models honor it. Position was the entire fix.
What single process do all these examples share?
A diagnostic loop: run the naive prompt, observe the model-specific gap, apply the minimal architecture-aware fix, and re-test. The specific fixes differ by scenario, but the loop is identical, and learning the loop is more valuable than memorizing any individual fix.
Can I predict the gap before running the prompt?
Often, yes. The gap usually follows from the model's family: verbosity from verbose models, over-reasoning from reasoning models, instruction rejection from specialized models. Knowing the family gives you a strong head start, though empirical confirmation is still required to be sure.
Key Takeaways
- An explicit output contract neutralized opposite verbosity defaults across two chat models in the same task.
- A step-by-step cue helped a chat model on a logic puzzle but hurt a reasoning model on the same one.
- An embedding classifier rejected plain-English instructions and needed labeled example text instead.
- Moving a critical instruction to the start fixed inconsistent mid-context attention across models.
- Every scenario used the same loop: run naive, observe the family-predictable gap, fix minimally, re-test.
- A few on-voice examples constrained tone better than description; a trivial task on a reasoning model called for routing, not prompting.