You have a prompt that works on one model and a reason to run it on others: cost, speed, a client requirement, a vendor change. The question is not whether models differ, which they do, but what specific steps move your prompt from working on one to working on several. This article answers that with a sequence you can start now.
Each step is concrete and ordered. You do the first, then the second, and so on, and at the end you have a prompt that performs reliably across the models you care about plus the evidence to prove it. No theory for its own sake; every step produces something you can use.
If you want the conceptual background on why architectures differ before diving into the procedure, read The Complete Guide to Prompting Across Different Model Architectures first. Otherwise, start here.
Step One: Separate the Core From the Scaffolding
Identify the Task Core
Look at your working prompt and isolate the part that defines the actual task, the instruction that would be true on any model. Summarize this document. Extract these fields. Classify this text. That is the core, and it stays constant across every model.
Identify the Scaffolding
Everything else is scaffolding: format reminders, length limits, reasoning cues, examples, tone instructions. This is the part that will change per model. Marking the boundary between core and scaffolding is the foundational move; everything later depends on it.
- Write the core as a single model-neutral statement
- List the scaffolding pieces separately
- Keep them in separate sections so you can swap scaffolding cleanly
Step Two: List Your Target Models
Enumerate the Models
Write down every model you intend to support. For each, note its family: generative chat, reasoning-optimized, or specialized. The family tells you what kind of scaffolding adjustments to expect before you even run anything.
Note Each Model's Profile
For each target, jot what you know or can find about its verbosity, format defaults, and reasoning behavior. The model card is your fastest source. This profile becomes your prediction of how the prompt will behave, which you will confirm in later steps.
Step Three: Build a Frozen Test Set
Choose Representative Inputs
Pick a handful of inputs that span your real usage: an easy one, a hard one, an edge case, a malformed one. For each, write down what a correct output must contain. This set is how you will compare models fairly.
Freeze It
Do not change the test set between models. The whole point is an apples-to-apples comparison, which only works if every model faces identical inputs. This frozen set is also the backbone of ongoing robustness work, detailed in Building a Repeatable Workflow for Prompt Sensitivity and Robustness Testing.
- Five to ten inputs is enough to start
- Pair each input with explicit pass criteria
- Save it as a file you reuse, not throwaway notes
Step Four: Run the Core on Each Model
Start Minimal
Send just the core, with minimal scaffolding, to each target model and record the outputs. This baseline shows you each model's natural behavior on your task before you start adjusting. Often a model handles the core better or worse than you predicted.
Diagnose the Gaps
Compare each model's baseline output against your pass criteria. Where it falls short, name the gap precisely: wrong format, too verbose, missed a field, over-reasoned. The specific gap tells you the specific scaffolding to add.
Step Five: Add Scaffolding Per Model
Address Each Gap Deliberately
For each gap, add the minimal scaffolding that closes it. A format gap gets an explicit format instruction. A verbosity gap gets a length limit. An over-reasoning gap on a reasoning model gets a removed step-by-step cue, not an added one.
- Add one scaffolding change at a time and re-test
- Prefer the smallest fix that works
- Resist copying scaffolding between models without checking it helps
Respect Architecture-Specific Rules
Remember that reasoning models often need less instruction, not more. Specialized models may need their input reshaped rather than instructed in prose. Apply the rule that fits the family rather than one universal recipe.
Step Six: Re-Test and Record
Run the Frozen Set Again
After adjusting scaffolding for a model, run the full frozen test set against it and record the results. Confirm every case now passes. If a fix broke a previously passing case, you have a regression to resolve before moving on.
Keep a Per-Model Record
Store the final prompt variant for each model alongside its test results. This record is your proof that the prompt works across architectures and your starting point next time a model changes. The brittleness this guards against is covered in Stress-Testing Prompts Before They Reach a Client.
Step Seven: Maintain Over Time
Re-Run on Change
Whenever a model updates or you add a new target, re-run the frozen set. Models drift, and a variant that passed last month can fail today. Treating the test set as a living check rather than a one-time gate keeps the whole thing trustworthy.
Grow the Test Set From Failures
Every time a model surprises you in real use, add that input to the frozen set. The set gets smarter over time, encoding exactly how your prompts break across architectures, which is the most useful documentation you can keep.
A Worked Mini-Example
The Setup
Suppose your task is to extract a person's name and email from a block of text and return them as two named fields. The core is simple: extract name and email as structured data. You want this to run on a verbose chat model and a terse one.
Walking the Steps
Step one isolates that core. Step two lists the two models and notes the first defaults to long, friendly output and the second to clipped output. Step three builds five inputs: a clean one, one with no email, one with two emails, one with a misspelled label, and one that is mostly noise, each paired with the correct expected result.
- The no-email case checks how each model handles a missing field
- The two-email case checks which one it picks and whether that is acceptable
- The noise case checks that the model does not invent data
Diagnosing and Fixing
Step four runs the bare core. The verbose model wraps the two fields in a paragraph; the terse model omits the email entirely on the no-email case. Step five adds one explicit contract, return name and email as structured fields, null if absent. The verbose model drops its paragraph; the terse model now returns null instead of omitting the field. Step six re-runs all five inputs and confirms both models pass. The whole exercise takes under an hour and demonstrates the loop end to end, mirroring the patterns in Concrete Scenarios Where Model Architecture Changed the Prompt.
Frequently Asked Questions
What is the first thing to do when adapting a prompt to a new model?
Separate the task core from the scaffolding. The core is the model-neutral instruction that defines the task; the scaffolding is the format reminders, length limits, and reasoning cues around it. You keep the core constant and adjust only the scaffolding per model.
Why start with minimal scaffolding on each model?
To see each model's natural behavior on your task before you intervene. The minimal baseline reveals the actual gaps, which tells you precisely what scaffolding to add. Starting with heavy scaffolding hides what the model would have done on its own.
How big should the frozen test set be?
Five to ten inputs is a reasonable start, spanning easy, hard, edge, and malformed cases, each paired with explicit pass criteria. The set must stay frozen across models so comparisons are fair, and it should grow as real-world failures reveal new cases worth covering.
Why might I remove instructions for a reasoning model?
Because reasoning-optimized models already think through problems internally. An explicit step-by-step cue can be redundant or even degrade the answer. For those models the adjustment is often to subtract scaffolding and state the problem cleanly rather than to add more.
How do I know an adjustment did not break something else?
Re-run the full frozen test set after every change, not just the case you were fixing. If a previously passing case now fails, you have introduced a regression and must resolve it before moving on. Full re-runs are what catch these side effects.
How often should I revisit a prompt that already works across models?
Whenever a model updates, whenever you add a new target model, and on a recurring schedule even when nothing changes, because models drift on their own. Re-running the frozen set on these triggers keeps your cross-model prompt trustworthy over time.
Key Takeaways
- Begin by separating the model-neutral task core from the scaffolding you will adjust per model.
- Build a frozen test set of representative inputs with explicit pass criteria and reuse it for every model.
- Run the core with minimal scaffolding first to diagnose each model's real gaps before adjusting.
- Close gaps with the smallest scaffolding change, remembering reasoning models often need less, not more.
- Re-test the full set after every change, keep per-model records, and re-run whenever a model drifts.
- A simple name-and-email extraction shows the full loop running in under an hour across two models.