Hardening a Prompt Before It Meets Real Traffic

A checklist is only useful if you can actually run it before launch and trust that clearing it means something. This one is built to be a gate: twelve concrete checks, each with a short justification, that a prompt should pass before it meets real users. None are aspirational. Each maps to a failure that regularly reaches production when skipped.

Use it literally. Copy the twelve items, run them against your prompt, and treat any unchecked box as a launch blocker until you have a deliberate reason to waive it. The justifications matter because a checklist you do not understand becomes ritual; a checklist you understand becomes judgment.

The checks are ordered roughly by sequence, from defining the target through to scheduling reruns. You can run the whole thing against a single prompt in an afternoon, and far faster on later passes once your attack inventory exists.

A note on how to hold this checklist. It is tempting to treat any list as a box-ticking ritual, racing to the bottom so you can declare yourself done. Resist that. The justifications under each item exist to keep the check meaningful, because a box you tick without understanding protects nothing. The goal is not a complete-looking list; it is a prompt you would be comfortable putting in front of a stranger who is actively trying to misuse it. Read each justification, decide whether it applies to your stakes, and only then mark the box.

Before You Test: Define and Scope

Check 1: Boundaries Are Written Down

Confirm you have a written statement of what the prompt must do and must never do, in specific terms. You cannot test a boundary you have not named, and "be helpful and safe" is not testable. This definition is the standard for every later check.

Check 2: Stakes Are Classified

Confirm you have noted what a failure would cost. A prompt that can move money or expose data warrants far more scrutiny than one suggesting blog titles. Stakes determine how hard you push, as we argue in Habits That Keep a Production Prompt From Caving In.

Core Attack Coverage

Check 3: Instruction Override Attempts Tested

Confirm you have tried to make the model abandon its rules ("ignore your instructions," role reassignment, "the policy is wrong"). Override is the most common attack, so a prompt that has not faced it has not been tested.

Check 4: Scope Probing Tested

Confirm you have pushed reasonable-sounding requests just outside the allowed job. Scope drift rarely looks like an attack; it looks like a slightly-too-broad question the prompt should have declined. The test for this check is whether your inputs include requests a well-meaning user might genuinely make, not just obviously hostile ones. If every scope attack looks like an attack, you have only tested the easy half of the problem and missed the dangerous half that arrives disguised as ordinary curiosity.

Check 5: Indirect Injection Tested

Confirm you have pasted content containing hidden instructions and verified the model treats it as data, not commands. Any prompt that ingests user-supplied documents or URLs is exposed to injection.

Check 6: Malformed Inputs Tested

Confirm you have sent empty, oversized, mixed-language, and nonsensical inputs. Boring malformed inputs cause real outages and happen constantly by accident, as shown in When Real Users Attack: Concrete Prompt-Breaking Scenarios. This check is easy to skip because malformed inputs feel too trivial to bother with, which is exactly why they slip through to production. A blank submission, a pasted spreadsheet, a message in an unexpected language: users generate these constantly without any intent to break anything. A prompt that handles clever attacks but stumbles on an empty box still fails its first ordinary day.

Check 7: Domain-Specific Attacks Tested

Confirm most of your attacks target your specific domain's expensive failures, not just generic ones. Generic attacks find generic problems; your costly failures live in your domain. A quick way to audit this check is to count how many of your attacks would make sense against a completely different product. If most of them would, your inventory is too generic. The attacks that only make sense against your specific prompt, the refund requests, the diagnosis bait, the cross-account probes, are the ones earning their place.

Evaluation and Fixes

Check 8: Outputs Judged Against Boundaries, Not Tone

Confirm each output was labeled pass or fail against the written boundaries, and that confident-sounding answers were verified rather than trusted. Fluency is not correctness.

Check 9: Failures Logged Reproducibly

Confirm every failure was recorded with the verbatim input, output, model, and settings. A failure you cannot reproduce is one you cannot fix or verify. The full procedure lives in Run Hostile Inputs at Your Prompts, One Step at a Time.

Check 10: Fixes Applied One at a Time

Confirm fixes were isolated and the full set rerun between them. Bundled fixes hide which edit helped and which broke a legitimate use case.

Before You Ship: Verify and Schedule

Check 11: Full Inventory Re-Run Clean

Confirm a final rerun of the entire attack inventory shows zero high-severity failures and only acceptable low-severity issues. The rerun, not the first pass, is what proves the prompt is ready.

Check 12: Reruns Scheduled on Real Triggers

Confirm the inventory is saved as a regression suite and scheduled to rerun on any prompt change, model upgrade, or new capability. These triggers are when safe behavior most often regresses, a risk weighed in Manual Red-Teaming or Automated Fuzzing: Choosing Your Approach.

Using the Checklist as a Living Gate

Wire It Into the Workflow

A checklist that lives in someone's memory gets skipped under deadline pressure. Put it where the work happens: as a section in the pull request that ships a prompt change, or as a required step in your release process. The cheapest way to guarantee the checklist gets run is to make running it the path of least resistance rather than an act of discipline.

Let the Inventory Carry the Weight Over Time

After the first pass, most of these checks collapse into "rerun the saved inventory and confirm it is clean." The early checks, defining boundaries and classifying stakes, only need a fresh look when the prompt's purpose genuinely changes. This is why the recurring cost stays low: the expensive thinking happens once, and the checklist mostly verifies that nothing has quietly regressed since.

Frequently Asked Questions

Can I skip checks for a low-stakes prompt?

You can scale down, but do so deliberately rather than by accident. For a genuinely trivial prompt, the malformed-input and override checks alone catch most problems. The point of check 2 is to make the scaling decision conscious and defensible.

What counts as a launch blocker versus a waivable item?

Any high-severity failure, an unfinished boundary definition, or a missing rerun should block launch. Low-severity tone issues can often be waived with a note. The written boundaries and stakes classification tell you which category a given gap falls into.

How long does running the full checklist take?

The first pass on a single prompt typically takes an afternoon. Later passes are much faster because the attack inventory already exists and only needs rerunning. The recurring cost is low, which is what makes scheduled reruns practical.

Do I need tools to run this checklist?

No. Every check can be performed by typing inputs into the same interface users see and reading outputs. Tools help automate the repetition once your inventory grows, but the checklist itself is tool-agnostic and can start manually today.

What if I cannot pass check 11 no matter what I change in the prompt?

That usually means the fix belongs outside the prompt, such as input filtering or access scoping. A persistently failing attack family is a signal to harden the surrounding system rather than to keep rewording, and it is a legitimate reason to escalate beyond the prompt layer.

Key Takeaways

A checklist is only useful as a real launch gate, with unchecked items treated as blockers.
Define written boundaries and classify stakes before running any attacks.
Cover override, scope, injection, malformed, and domain-specific attacks every time.
Judge outputs against boundaries, log failures reproducibly, and fix one change at a time.
A clean rerun of the full inventory, plus scheduled reruns on real triggers, is what proves readiness.

Before You Test: Define and Scope

Check 1: Boundaries Are Written Down

Check 2: Stakes Are Classified

Core Attack Coverage

Check 3: Instruction Override Attempts Tested

Check 4: Scope Probing Tested

Check 5: Indirect Injection Tested

Confirm you have pasted content containing hidden instructions and verified the model treats it as data, not commands. Any prompt that ingests user-supplied documents or URLs is exposed to injection.

Check 6: Malformed Inputs Tested

Check 7: Domain-Specific Attacks Tested

Evaluation and Fixes

Check 8: Outputs Judged Against Boundaries, Not Tone

Confirm each output was labeled pass or fail against the written boundaries, and that confident-sounding answers were verified rather than trusted. Fluency is not correctness.

Check 9: Failures Logged Reproducibly

Check 10: Fixes Applied One at a Time

Confirm fixes were isolated and the full set rerun between them. Bundled fixes hide which edit helped and which broke a legitimate use case.

Before You Ship: Verify and Schedule

Check 11: Full Inventory Re-Run Clean

Confirm a final rerun of the entire attack inventory shows zero high-severity failures and only acceptable low-severity issues. The rerun, not the first pass, is what proves the prompt is ready.

Check 12: Reruns Scheduled on Real Triggers

Using the Checklist as a Living Gate

Wire It Into the Workflow

Let the Inventory Carry the Weight Over Time

Frequently Asked Questions

Can I skip checks for a low-stakes prompt?

What counts as a launch blocker versus a waivable item?

How long does running the full checklist take?

Do I need tools to run this checklist?

What if I cannot pass check 11 no matter what I change in the prompt?

Key Takeaways

A checklist is only useful as a real launch gate, with unchecked items treated as blockers.
Define written boundaries and classify stakes before running any attacks.
Cover override, scope, injection, malformed, and domain-specific attacks every time.
Judge outputs against boundaries, log failures reproducibly, and fix one change at a time.
A clean rerun of the full inventory, plus scheduled reruns on real triggers, is what proves readiness.

Hardening a Prompt Before It Meets Real Traffic

Before You Test: Define and Scope

Check 1: Boundaries Are Written Down

Check 2: Stakes Are Classified

Core Attack Coverage

Check 3: Instruction Override Attempts Tested

Check 4: Scope Probing Tested

Check 5: Indirect Injection Tested

Check 6: Malformed Inputs Tested

Check 7: Domain-Specific Attacks Tested

Evaluation and Fixes

Check 8: Outputs Judged Against Boundaries, Not Tone

Check 9: Failures Logged Reproducibly

Check 10: Fixes Applied One at a Time

Before You Ship: Verify and Schedule

Check 11: Full Inventory Re-Run Clean

Check 12: Reruns Scheduled on Real Triggers

Using the Checklist as a Living Gate

Wire It Into the Workflow

Let the Inventory Carry the Weight Over Time

Frequently Asked Questions

Can I skip checks for a low-stakes prompt?

What counts as a launch blocker versus a waivable item?

How long does running the full checklist take?

Do I need tools to run this checklist?

What if I cannot pass check 11 no matter what I change in the prompt?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Hardening a Prompt Before It Meets Real Traffic

Before You Test: Define and Scope

Check 1: Boundaries Are Written Down

Check 2: Stakes Are Classified

Core Attack Coverage

Check 3: Instruction Override Attempts Tested

Check 4: Scope Probing Tested

Check 5: Indirect Injection Tested

Check 6: Malformed Inputs Tested

Check 7: Domain-Specific Attacks Tested

Evaluation and Fixes

Check 8: Outputs Judged Against Boundaries, Not Tone

Check 9: Failures Logged Reproducibly

Check 10: Fixes Applied One at a Time

Before You Ship: Verify and Schedule

Check 11: Full Inventory Re-Run Clean

Check 12: Reruns Scheduled on Real Triggers

Using the Checklist as a Living Gate

Wire It Into the Workflow

Let the Inventory Carry the Weight Over Time

Frequently Asked Questions

Can I skip checks for a low-stakes prompt?

What counts as a launch blocker versus a waivable item?

How long does running the full checklist take?

Do I need tools to run this checklist?

What if I cannot pass check 11 no matter what I change in the prompt?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?