Stop Forgetting Your System Prompt Exists

Most teams write a system prompt once, paste it into a config file, and forget it exists until a user does something embarrassing. A playbook fixes that. It treats the system prompt as an operational asset with named plays, clear triggers, and an owner for each, so the prompt evolves on purpose instead of by panic.

This is not a beginner explainer. If you need the basics, read What Is a System Prompt: A Beginner's Guide first. This piece is for the person who already ships an assistant and needs a repeatable way to run, change, and defend its instruction layer.

The structure below is organized by play. Each play has a trigger that tells you when to run it, an owner who is accountable, and a sequence of steps. Borrow the ones you need and ignore the rest.

Play 1: The Initial Build

Trigger: You are standing up a new assistant or replacing a placeholder prompt.

Owner: The product owner for the feature, with engineering support.

Start with role and scope, not rules. Write one sentence that names who the assistant is and what it is for. Everything else hangs off that. Then add the four blocks that nearly every strong prompt needs:

Role and scope: who the assistant is and what it will not do.
Behavior rules: tone, format, refusal conditions, escalation paths.
Domain context: the facts and policies specific to your product.
Output format: structure, length, and any required fields.

Resist the urge to anticipate every edge case on day one. You will discover the real ones in production. Ship a tight first version and let the later plays handle the rest.

Play 2: The Behavior Change Request

Trigger: Someone, often sales or support, asks the assistant to do something new or stop doing something.

Owner: A single prompt maintainer, never a committee editing live.

This is where prompts rot. A request comes in, someone edits the prompt in a hurry, and three weeks later nobody knows why a rule exists. Run the change through a sequence instead:

Write the request as a behavior statement: "When a user asks X, the assistant should do Y."
Find the existing rule it conflicts with, if any, and decide which wins.
Make the smallest edit that produces the behavior.
Run the test set before merging.

The conflict check in step two is the one people skip, and it is the one that causes regressions. The common mistakes article catalogs what happens when you do not.

Play 3: The Incident Response

Trigger: The assistant said or did something it should not have, in production, in front of a user.

Owner: On-call engineer, escalating to the prompt maintainer.

Speed matters here, but so does not overcorrecting. A single bad output often triggers a heavy-handed rule that breaks ten good behaviors.

The sequence

Capture the exact input and output. Do not paraphrase.
Reproduce it in a test harness before touching the prompt.
Decide whether this is a prompt problem or a code problem. Anything safety-critical should move to code.
Add the failing case to your permanent test set so it can never silently return.

The last step is what turns an incident into a durable fix instead of a recurring fire.

Play 4: The Periodic Audit

Trigger: A calendar interval, monthly or quarterly, plus any major model upgrade.

Owner: The prompt maintainer, with a reviewer from outside the team.

Prompts accumulate cruft. Old rules outlive the problems they solved, examples reference deprecated features, and length creeps up call by call. An audit is scheduled cleanup.

Walk the prompt top to bottom and ask of every line: is this still true, is it still needed, and is it stated once. Cut what fails. Then re-run the full test set, because cleanup can change behavior in ways that feel safe but are not.

Play 5: The Model Migration

Trigger: You are moving to a new model or provider.

Owner: Engineering, with the prompt maintainer reviewing outputs.

A system prompt is not portable. Different models weight instructions differently, handle formatting differently, and refuse differently. Assume the prompt will misbehave on the new model until proven otherwise.

Run your entire test set on the new model before you switch a single user. Pay special attention to refusals and formatting, the two areas where models diverge most. Expect to rewrite parts of the prompt, not just paste it across. Treat this as a real migration, not a config flip.

Sequencing the Plays

Plays are not independent. They feed each other, and the order matters.

How they connect

The Initial Build creates the prompt and the first test set.
Behavior Change and Incident Response both grow the test set as they run.
The Periodic Audit prunes what those two added.
Model Migration stress-tests everything the others produced.

The connective tissue across all of them is the test set. It is the institutional memory of your prompt. Every play either reads from it or writes to it. If you take one thing from this playbook, build and guard that test set. For the discipline behind it, see Best Practices That Actually Work.

Roles and Ownership

A playbook without owners is a wish list. Assign these clearly.

Prompt maintainer: one person who owns the canonical prompt and approves changes. Not a group.
Product owner: decides what the assistant should do, sets the behavior priorities.
On-call engineer: handles incidents and reproduces failures.
Outside reviewer: a fresh set of eyes for audits, to catch the rules the maintainer has gone blind to.

Diffuse ownership is the single biggest reason system prompts decay. When everyone can edit and no one is accountable, the prompt becomes a junk drawer. One maintainer, with a real review path, keeps it coherent.

Frequently Asked Questions

How often should I actually run the audit play?

Monthly for high-traffic assistants, quarterly for stable internal tools, and always after a model upgrade. The trigger is not just the calendar. If you notice the prompt has grown by a third since the last audit, or if behavior changes are getting risky, run it early. The cost of a stale prompt is silent, so do not wait for an incident to force it.

Who should own the prompt if we do not have a dedicated AI team?

Pick the person closest to the product who can also read code, usually a senior engineer or a technical product manager. The role does not require an AI specialist. It requires someone accountable who understands both the product behavior and the test set. The worst outcome is shared ownership across a whole team with no single approver.

What goes in code versus the prompt during incident response?

Anything that must never happen goes in code. Refunds above a threshold, data deletion, disclosure of regulated information: enforce these in application logic, not prompt text. The prompt handles tone, routine routing, and soft guidance. During an incident, the first question is always whether a prompt rule was the right fix or a band-aid over a missing code guardrail.

Can I run these plays without a formal test set?

You can, but you will reintroduce old bugs constantly. The test set is what makes the plays repeatable instead of heroic. Even a flat file of twenty input-output pairs that you run by hand beats nothing. Start small, add a case every time something breaks, and the set will become your most valuable prompt asset within a month.

Does every behavior change really need the full sequence?

For anything user-facing, yes. The conflict check and the test run are where regressions get caught, and a "tiny" edit is exactly the kind that quietly breaks three other behaviors. For an experimental internal tool, you can move faster. The rule of thumb: the more users see it, the more discipline the change deserves.

Key Takeaways

Treat the system prompt as an operational asset with named plays, triggers, and a single owner.
The Initial Build ships tight; later plays handle the edge cases production reveals.
Behavior changes need a conflict check and a test run, every time, to prevent silent regressions.
Incident response should add the failing case to a permanent test set, not just patch and move on.
Audits prune accumulated cruft; model migrations require full re-testing, never a config flip.
The test set is the connective tissue across all plays and the memory of your prompt.

The structure below is organized by play. Each play has a trigger that tells you when to run it, an owner who is accountable, and a sequence of steps. Borrow the ones you need and ignore the rest.

Play 1: The Initial Build

Trigger: You are standing up a new assistant or replacing a placeholder prompt.

Owner: The product owner for the feature, with engineering support.

Role and scope: who the assistant is and what it will not do.
Behavior rules: tone, format, refusal conditions, escalation paths.
Domain context: the facts and policies specific to your product.
Output format: structure, length, and any required fields.

Resist the urge to anticipate every edge case on day one. You will discover the real ones in production. Ship a tight first version and let the later plays handle the rest.

Play 2: The Behavior Change Request

Trigger: Someone, often sales or support, asks the assistant to do something new or stop doing something.

Owner: A single prompt maintainer, never a committee editing live.

This is where prompts rot. A request comes in, someone edits the prompt in a hurry, and three weeks later nobody knows why a rule exists. Run the change through a sequence instead:

Write the request as a behavior statement: "When a user asks X, the assistant should do Y."
Find the existing rule it conflicts with, if any, and decide which wins.
Make the smallest edit that produces the behavior.
Run the test set before merging.

The conflict check in step two is the one people skip, and it is the one that causes regressions. The common mistakes article catalogs what happens when you do not.

Play 3: The Incident Response

Trigger: The assistant said or did something it should not have, in production, in front of a user.

Owner: On-call engineer, escalating to the prompt maintainer.

Speed matters here, but so does not overcorrecting. A single bad output often triggers a heavy-handed rule that breaks ten good behaviors.

The sequence

Capture the exact input and output. Do not paraphrase.
Reproduce it in a test harness before touching the prompt.
Decide whether this is a prompt problem or a code problem. Anything safety-critical should move to code.
Add the failing case to your permanent test set so it can never silently return.

The last step is what turns an incident into a durable fix instead of a recurring fire.

Play 4: The Periodic Audit

Trigger: A calendar interval, monthly or quarterly, plus any major model upgrade.

Owner: The prompt maintainer, with a reviewer from outside the team.

Prompts accumulate cruft. Old rules outlive the problems they solved, examples reference deprecated features, and length creeps up call by call. An audit is scheduled cleanup.

Play 5: The Model Migration

Trigger: You are moving to a new model or provider.

Owner: Engineering, with the prompt maintainer reviewing outputs.

Sequencing the Plays

Plays are not independent. They feed each other, and the order matters.

How they connect

The Initial Build creates the prompt and the first test set.
Behavior Change and Incident Response both grow the test set as they run.
The Periodic Audit prunes what those two added.
Model Migration stress-tests everything the others produced.

Roles and Ownership

A playbook without owners is a wish list. Assign these clearly.

Prompt maintainer: one person who owns the canonical prompt and approves changes. Not a group.
Product owner: decides what the assistant should do, sets the behavior priorities.
On-call engineer: handles incidents and reproduces failures.
Outside reviewer: a fresh set of eyes for audits, to catch the rules the maintainer has gone blind to.

Frequently Asked Questions

How often should I actually run the audit play?

Who should own the prompt if we do not have a dedicated AI team?

What goes in code versus the prompt during incident response?

Can I run these plays without a formal test set?

Does every behavior change really need the full sequence?

Key Takeaways

Treat the system prompt as an operational asset with named plays, triggers, and a single owner.
The Initial Build ships tight; later plays handle the edge cases production reveals.
Behavior changes need a conflict check and a test run, every time, to prevent silent regressions.
Incident response should add the failing case to a permanent test set, not just patch and move on.
Audits prune accumulated cruft; model migrations require full re-testing, never a config flip.
The test set is the connective tissue across all plays and the memory of your prompt.

Stop Forgetting Your System Prompt Exists

Play 1: The Initial Build

Play 2: The Behavior Change Request

Play 3: The Incident Response

The sequence

Play 4: The Periodic Audit

Play 5: The Model Migration

Sequencing the Plays

How they connect

Roles and Ownership

Frequently Asked Questions

How often should I actually run the audit play?

Who should own the prompt if we do not have a dedicated AI team?

What goes in code versus the prompt during incident response?

Can I run these plays without a formal test set?

Does every behavior change really need the full sequence?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Stop Forgetting Your System Prompt Exists

Play 1: The Initial Build

Play 2: The Behavior Change Request

Play 3: The Incident Response

The sequence

Play 4: The Periodic Audit

Play 5: The Model Migration

Sequencing the Plays

How they connect

Roles and Ownership

Frequently Asked Questions

How often should I actually run the audit play?

Who should own the prompt if we do not have a dedicated AI team?

What goes in code versus the prompt during incident response?

Can I run these plays without a formal test set?

Does every behavior change really need the full sequence?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?