AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Step One: Separate the Core From the ScaffoldingIdentify the Task CoreIdentify the ScaffoldingStep Two: List Your Target ModelsEnumerate the ModelsNote Each Model's ProfileStep Three: Build a Frozen Test SetChoose Representative InputsFreeze ItStep Four: Run the Core on Each ModelStart MinimalDiagnose the GapsStep Five: Add Scaffolding Per ModelAddress Each Gap DeliberatelyRespect Architecture-Specific RulesStep Six: Re-Test and RecordRun the Frozen Set AgainKeep a Per-Model RecordStep Seven: Maintain Over TimeRe-Run on ChangeGrow the Test Set From FailuresA Worked Mini-ExampleThe SetupWalking the StepsDiagnosing and FixingFrequently Asked QuestionsWhat is the first thing to do when adapting a prompt to a new model?Why start with minimal scaffolding on each model?How big should the frozen test set be?Why might I remove instructions for a reasoning model?How do I know an adjustment did not break something else?How often should I revisit a prompt that already works across models?Key Takeaways
Home/Blog/Adapting One Prompt to Several Models, Step by Step
General

Adapting One Prompt to Several Models, Step by Step

A

Agency Script Editorial

Editorial Team

·January 3, 2020·8 min read
prompting across different model architecturesprompting across different model architectures how toprompting across different model architectures guideprompt engineering

You have a prompt that works on one model and a reason to run it on others: cost, speed, a client requirement, a vendor change. The question is not whether models differ, which they do, but what specific steps move your prompt from working on one to working on several. This article answers that with a sequence you can start now.

Each step is concrete and ordered. You do the first, then the second, and so on, and at the end you have a prompt that performs reliably across the models you care about plus the evidence to prove it. No theory for its own sake; every step produces something you can use.

If you want the conceptual background on why architectures differ before diving into the procedure, read The Complete Guide to Prompting Across Different Model Architectures first. Otherwise, start here.

Step One: Separate the Core From the Scaffolding

Identify the Task Core

Look at your working prompt and isolate the part that defines the actual task, the instruction that would be true on any model. Summarize this document. Extract these fields. Classify this text. That is the core, and it stays constant across every model.

Identify the Scaffolding

Everything else is scaffolding: format reminders, length limits, reasoning cues, examples, tone instructions. This is the part that will change per model. Marking the boundary between core and scaffolding is the foundational move; everything later depends on it.

  • Write the core as a single model-neutral statement
  • List the scaffolding pieces separately
  • Keep them in separate sections so you can swap scaffolding cleanly

Step Two: List Your Target Models

Enumerate the Models

Write down every model you intend to support. For each, note its family: generative chat, reasoning-optimized, or specialized. The family tells you what kind of scaffolding adjustments to expect before you even run anything.

Note Each Model's Profile

For each target, jot what you know or can find about its verbosity, format defaults, and reasoning behavior. The model card is your fastest source. This profile becomes your prediction of how the prompt will behave, which you will confirm in later steps.

Step Three: Build a Frozen Test Set

Choose Representative Inputs

Pick a handful of inputs that span your real usage: an easy one, a hard one, an edge case, a malformed one. For each, write down what a correct output must contain. This set is how you will compare models fairly.

Freeze It

Do not change the test set between models. The whole point is an apples-to-apples comparison, which only works if every model faces identical inputs. This frozen set is also the backbone of ongoing robustness work, detailed in Building a Repeatable Workflow for Prompt Sensitivity and Robustness Testing.

  • Five to ten inputs is enough to start
  • Pair each input with explicit pass criteria
  • Save it as a file you reuse, not throwaway notes

Step Four: Run the Core on Each Model

Start Minimal

Send just the core, with minimal scaffolding, to each target model and record the outputs. This baseline shows you each model's natural behavior on your task before you start adjusting. Often a model handles the core better or worse than you predicted.

Diagnose the Gaps

Compare each model's baseline output against your pass criteria. Where it falls short, name the gap precisely: wrong format, too verbose, missed a field, over-reasoned. The specific gap tells you the specific scaffolding to add.

Step Five: Add Scaffolding Per Model

Address Each Gap Deliberately

For each gap, add the minimal scaffolding that closes it. A format gap gets an explicit format instruction. A verbosity gap gets a length limit. An over-reasoning gap on a reasoning model gets a removed step-by-step cue, not an added one.

  • Add one scaffolding change at a time and re-test
  • Prefer the smallest fix that works
  • Resist copying scaffolding between models without checking it helps

Respect Architecture-Specific Rules

Remember that reasoning models often need less instruction, not more. Specialized models may need their input reshaped rather than instructed in prose. Apply the rule that fits the family rather than one universal recipe.

Step Six: Re-Test and Record

Run the Frozen Set Again

After adjusting scaffolding for a model, run the full frozen test set against it and record the results. Confirm every case now passes. If a fix broke a previously passing case, you have a regression to resolve before moving on.

Keep a Per-Model Record

Store the final prompt variant for each model alongside its test results. This record is your proof that the prompt works across architectures and your starting point next time a model changes. The brittleness this guards against is covered in Stress-Testing Prompts Before They Reach a Client.

Step Seven: Maintain Over Time

Re-Run on Change

Whenever a model updates or you add a new target, re-run the frozen set. Models drift, and a variant that passed last month can fail today. Treating the test set as a living check rather than a one-time gate keeps the whole thing trustworthy.

Grow the Test Set From Failures

Every time a model surprises you in real use, add that input to the frozen set. The set gets smarter over time, encoding exactly how your prompts break across architectures, which is the most useful documentation you can keep.

A Worked Mini-Example

The Setup

Suppose your task is to extract a person's name and email from a block of text and return them as two named fields. The core is simple: extract name and email as structured data. You want this to run on a verbose chat model and a terse one.

Walking the Steps

Step one isolates that core. Step two lists the two models and notes the first defaults to long, friendly output and the second to clipped output. Step three builds five inputs: a clean one, one with no email, one with two emails, one with a misspelled label, and one that is mostly noise, each paired with the correct expected result.

  • The no-email case checks how each model handles a missing field
  • The two-email case checks which one it picks and whether that is acceptable
  • The noise case checks that the model does not invent data

Diagnosing and Fixing

Step four runs the bare core. The verbose model wraps the two fields in a paragraph; the terse model omits the email entirely on the no-email case. Step five adds one explicit contract, return name and email as structured fields, null if absent. The verbose model drops its paragraph; the terse model now returns null instead of omitting the field. Step six re-runs all five inputs and confirms both models pass. The whole exercise takes under an hour and demonstrates the loop end to end, mirroring the patterns in Concrete Scenarios Where Model Architecture Changed the Prompt.

Frequently Asked Questions

What is the first thing to do when adapting a prompt to a new model?

Separate the task core from the scaffolding. The core is the model-neutral instruction that defines the task; the scaffolding is the format reminders, length limits, and reasoning cues around it. You keep the core constant and adjust only the scaffolding per model.

Why start with minimal scaffolding on each model?

To see each model's natural behavior on your task before you intervene. The minimal baseline reveals the actual gaps, which tells you precisely what scaffolding to add. Starting with heavy scaffolding hides what the model would have done on its own.

How big should the frozen test set be?

Five to ten inputs is a reasonable start, spanning easy, hard, edge, and malformed cases, each paired with explicit pass criteria. The set must stay frozen across models so comparisons are fair, and it should grow as real-world failures reveal new cases worth covering.

Why might I remove instructions for a reasoning model?

Because reasoning-optimized models already think through problems internally. An explicit step-by-step cue can be redundant or even degrade the answer. For those models the adjustment is often to subtract scaffolding and state the problem cleanly rather than to add more.

How do I know an adjustment did not break something else?

Re-run the full frozen test set after every change, not just the case you were fixing. If a previously passing case now fails, you have introduced a regression and must resolve it before moving on. Full re-runs are what catch these side effects.

How often should I revisit a prompt that already works across models?

Whenever a model updates, whenever you add a new target model, and on a recurring schedule even when nothing changes, because models drift on their own. Re-running the frozen set on these triggers keeps your cross-model prompt trustworthy over time.

Key Takeaways

  • Begin by separating the model-neutral task core from the scaffolding you will adjust per model.
  • Build a frozen test set of representative inputs with explicit pass criteria and reuse it for every model.
  • Run the core with minimal scaffolding first to diagnose each model's real gaps before adjusting.
  • Close gaps with the smallest scaffolding change, remembering reasoning models often need less, not more.
  • Re-test the full set after every change, keep per-model records, and re-run whenever a model drifts.
  • A simple name-and-email extraction shows the full loop running in under an hour across two models.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification