Habits That Keep a Production Prompt From Caving In

Best-practice lists tend to dissolve into platitudes: "test thoroughly," "think like an attacker," "iterate." Nobody disagrees, and nobody can act on it. The practices below are different. Each one is a specific, sometimes uncomfortable choice, paired with the reasoning that justifies it. Some will contradict how your team works today. That contradiction is where the value is.

These practices come from a simple observation: prompts that survive contact with real users share a small set of habits, and prompts that fail in production usually violate several of them at once. The habits are not about working harder. They are about working in an order and with a discipline that prevents the most common collapses.

Treat this as a set of defaults to adopt deliberately, not a checklist to skim. Where a practice feels like overkill for your stakes, scale it down on purpose rather than skipping it by accident.

Underneath all of them sits one conviction: a prompt is not a piece of clever writing, it is a component that will be operated by strangers under conditions you do not control. Once you accept that framing, these practices stop feeling like extra process and start feeling like the minimum care any production component deserves. The teams that internalize this ship prompts that hold; the teams that treat prompts as throwaway text ship prompts that get screenshotted misbehaving.

Separate the Author From the Attacker

Why the Author Is the Worst Tester

The person who wrote a prompt has already decided what users will do. Their tests rehearse that assumption. The most reliable way to find real weaknesses is to put the prompt in front of someone who did not write it and ask them to break it.

Make Adversarial Review a Role, Not a Mood

Assign someone the explicit job of attacker for each prompt, even if it is a colleague for an hour. A named role produces real attempts; "everyone should think adversarially" produces none. This pairs well with the structured process in Run Hostile Inputs at Your Prompts, One Step at a Time. The psychology matters here. People are reluctant to break a colleague's work without permission, so the role is also a license. Naming someone the attacker tells them their job today is to make this thing fail, and most people are surprisingly good at it once they are allowed.

Write Boundaries Before You Write Tests

Untested Boundaries Are Just Hopes

You cannot test a boundary you have not stated. Before any attack, write what the prompt must do and must never do, in concrete terms. This definition becomes the standard every output is judged against.

Make Boundaries Specific Enough to Fail

"Be helpful and safe" is untestable. "Never reveal another customer's data; never issue refunds; never give medical advice" is testable. Specificity is what lets a tester declare an output a clear pass or a clear fail instead of arguing about it.

Maintain a Living Attack Inventory

One-Off Testing Decays Immediately

A prompt tested once is safe for exactly that moment. Models change, prompts change, and new attack styles appear. A saved, versioned attack inventory is what makes testing repeatable rather than heroic.

Grow It From Real Traffic

The best new attacks come from watching how actual users phrase things. Feed surprising real inputs back into the inventory so it gets sharper over time. Pair the inventory with a launch gate like our Twelve Checks Before You Ship a Prompt to Real Traffic. Real users are collectively more inventive than any single tester, so their odd inputs are a free and constantly refreshing source of test cases. Treat every surprising production message as a candidate for the inventory rather than a one-off curiosity.

Prioritize by Damage, Not by Cleverness

Not All Failures Are Equal

A data leak and an awkward tone are both failures, but only one ends up in a breach report. Rank failures by what they would actually cost, and fix in that order. This keeps limited time aimed at the failures that matter.

Match Test Intensity to Stakes

A prompt that can move money or expose data deserves far more attacks than one that suggests blog titles. Spreading equal effort across every prompt wastes it on low-stakes ones and starves the dangerous ones. The practical move is to write down, for each prompt, the worst plausible outcome of a failure in a single sentence. That sentence sets the budget. A prompt whose worst case is an awkward email gets an hour; a prompt whose worst case is a regulatory incident gets days. Letting the stated worst case drive effort keeps your attention proportional to actual risk rather than spread evenly out of habit.

Fix Surgically and Re-Test Relentlessly

Isolate Every Change

Change one thing, rerun the full set, then change the next. Bundled fixes hide which edit helped and which one quietly broke a legitimate use case. Isolation keeps cause and effect visible.

Re-Test the Whole Set, Not the One Input

A fix can ripple. Rerunning only the failed input misses the regression it caused elsewhere. The full rerun is the practice that separates reliable prompts from fragile ones, a point we make repeatedly in Where Prompt Hardening Quietly Falls Apart.

Know When the Prompt Is the Wrong Layer

Some Problems Cannot Be Prompted Away

If a class of attacks keeps succeeding no matter how you word the prompt, the fix probably belongs elsewhere: input filtering, a narrower set of allowed actions, or human review for risky requests. Recognizing this saves hours of futile rewording.

Design Defense in Depth

The most resilient systems do not rely on the prompt alone. They combine a hardened prompt with guardrails around it, so a single failure does not become an incident. The trade-offs between layers are explored in Manual Red-Teaming or Automated Fuzzing: Choosing Your Approach.

Make the Safe Path the Easy Path

A practice that depends on heroics will not survive a busy week. The most durable habit a team can build is to make the protective path the path of least resistance: a saved inventory that reruns with one command, a launch checklist that lives in the pull request, a regression suite that runs automatically on every prompt change. When safety is automated and built into the workflow, it happens even under deadline pressure. When it depends on someone remembering to be diligent, it eventually does not happen at all.

Frequently Asked Questions

Which practice matters most if I can only adopt one?

Writing specific boundaries before testing. Everything else depends on it, because without a clear standard you cannot tell a pass from a fail, prioritize damage, or verify a fix. Specific boundaries make the entire rest of the discipline possible.

How do I justify the time these practices take?

Compare it to the cost of a single public failure. A prompt that leaks data, gives dangerous advice, or gets screenshotted misbehaving costs far more than the hours of testing that would have caught it. The practices are cheap insurance against expensive incidents.

Is defense in depth admitting the prompt failed?

No, it is acknowledging that prompts are one layer of a system. Even a well-hardened prompt benefits from input validation and limited permissions around it. Relying on the prompt alone is the fragile choice, not the sophisticated one.

How big should my attack inventory be?

Large enough to cover every attack family and your specific high-stakes cases, small enough that you actually rerun it. Quality and coverage beat raw count. Add an attack only when it tests behavior the existing set does not.

Can these practices be automated?

The mechanical parts can: running a saved inventory, capturing outputs, flagging changes. Judgment-heavy parts, like deciding whether a boundary was crossed in a subtle case, still benefit from human review. Automate the repetition, keep humans for the ambiguity.

Key Takeaways

Separate the prompt author from the attacker, since authors test their own assumptions.
Write specific, testable boundaries before writing any attacks.
Maintain a living, versioned attack inventory and grow it from real traffic.
Prioritize fixes by potential damage and match test intensity to stakes.
Fix one change at a time, rerun the full set, and move defenses to other layers when the prompt cannot hold.

Treat this as a set of defaults to adopt deliberately, not a checklist to skim. Where a practice feels like overkill for your stakes, scale it down on purpose rather than skipping it by accident.

Separate the Author From the Attacker

Why the Author Is the Worst Tester

Make Adversarial Review a Role, Not a Mood

Write Boundaries Before You Write Tests

Untested Boundaries Are Just Hopes

Make Boundaries Specific Enough to Fail

Maintain a Living Attack Inventory

One-Off Testing Decays Immediately

Grow It From Real Traffic

Prioritize by Damage, Not by Cleverness

Not All Failures Are Equal

Match Test Intensity to Stakes

Fix Surgically and Re-Test Relentlessly

Isolate Every Change

Change one thing, rerun the full set, then change the next. Bundled fixes hide which edit helped and which one quietly broke a legitimate use case. Isolation keeps cause and effect visible.

Re-Test the Whole Set, Not the One Input

Know When the Prompt Is the Wrong Layer

Some Problems Cannot Be Prompted Away

Design Defense in Depth

Make the Safe Path the Easy Path

Frequently Asked Questions

Which practice matters most if I can only adopt one?

How do I justify the time these practices take?

Is defense in depth admitting the prompt failed?

How big should my attack inventory be?

Can these practices be automated?

Key Takeaways

Separate the prompt author from the attacker, since authors test their own assumptions.
Write specific, testable boundaries before writing any attacks.
Maintain a living, versioned attack inventory and grow it from real traffic.
Prioritize fixes by potential damage and match test intensity to stakes.
Fix one change at a time, rerun the full set, and move defenses to other layers when the prompt cannot hold.

Habits That Keep a Production Prompt From Caving In

Separate the Author From the Attacker

Why the Author Is the Worst Tester

Make Adversarial Review a Role, Not a Mood

Write Boundaries Before You Write Tests

Untested Boundaries Are Just Hopes

Make Boundaries Specific Enough to Fail

Maintain a Living Attack Inventory

One-Off Testing Decays Immediately

Grow It From Real Traffic

Prioritize by Damage, Not by Cleverness

Not All Failures Are Equal

Match Test Intensity to Stakes

Fix Surgically and Re-Test Relentlessly

Isolate Every Change

Re-Test the Whole Set, Not the One Input

Know When the Prompt Is the Wrong Layer

Some Problems Cannot Be Prompted Away

Design Defense in Depth

Make the Safe Path the Easy Path

Frequently Asked Questions

Which practice matters most if I can only adopt one?

How do I justify the time these practices take?

Is defense in depth admitting the prompt failed?

How big should my attack inventory be?

Can these practices be automated?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Habits That Keep a Production Prompt From Caving In

Separate the Author From the Attacker

Why the Author Is the Worst Tester

Make Adversarial Review a Role, Not a Mood

Write Boundaries Before You Write Tests

Untested Boundaries Are Just Hopes

Make Boundaries Specific Enough to Fail

Maintain a Living Attack Inventory

One-Off Testing Decays Immediately

Grow It From Real Traffic

Prioritize by Damage, Not by Cleverness

Not All Failures Are Equal

Match Test Intensity to Stakes

Fix Surgically and Re-Test Relentlessly

Isolate Every Change

Re-Test the Whole Set, Not the One Input

Know When the Prompt Is the Wrong Layer

Some Problems Cannot Be Prompted Away

Design Defense in Depth

Make the Safe Path the Easy Path

Frequently Asked Questions

Which practice matters most if I can only adopt one?

How do I justify the time these practices take?

Is defense in depth admitting the prompt failed?

How big should my attack inventory be?

Can these practices be automated?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?