AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

What You Need Before You StartA Real Prompt and a Clear Definition of FailureA Way to Run the Prompt RepeatedlyAn Adversarial MindsetYour First Adversarial SessionStart With the Obvious AttacksPush on Your Specific BoundariesRecord EverythingTurning One Finding Into a ResultReproduce Before You FixFix, Then Re-TestSave the AttackBuilding the HabitTest on Every Prompt ChangeGrow the Suite From RealityKnow When to Level UpA Concrete First-Hour WalkthroughPick the Highest-Exposure PromptSpend Twenty Minutes Attacking, Not PlanningTriage What You FoundAvoiding Early MistakesDo Not Confuse a Weird Output With a FailureDo Not Skip the Re-TestDo Not Try to Be Comprehensive on Day OneFrequently Asked QuestionsDo I need security expertise to start?What should my very first attack be?How do I know if an odd output counts as a failure?How many attacks make a useful first session?What do I do once I find a failure?How do I keep this from being a one-time exercise?Key Takeaways
Home/Blog/Break Your Own Prompt Before a User Does
General

Break Your Own Prompt Before a User Does

A

Agency Script Editorial

Editorial Team

·October 6, 2019·8 min read
adversarial prompt stress testingadversarial prompt stress testing getting startedadversarial prompt stress testing guideprompt engineering

The hardest part of adversarial prompt testing is not the technique. It is getting started without convincing yourself you need a research lab first. You do not. You need one production prompt, an hour, and a willingness to think like someone trying to break it. The first time you watch your own carefully written prompt produce something embarrassing under a simple attack, the value becomes obvious and the program builds itself.

The goal of a first session is not comprehensive coverage. It is a single, real, reproducible failure — proof that the prompt is more fragile than it looks and that testing finds problems before customers do. Everything else grows from that first caught failure.

This piece gives you the fastest credible path from zero to that result: what to have ready, what to actually do in your first session, and how to turn one finding into a habit.

What You Need Before You Start

A Real Prompt and a Clear Definition of Failure

Pick a prompt that does something that matters — answers customers, summarizes documents, makes a decision. Then write down what "failure" means for it. Without a definition, you will produce odd outputs and not know whether they count. Failure might mean leaking instructions, going off-topic, fabricating facts, or breaking format.

A Way to Run the Prompt Repeatedly

You need to send many inputs through the same prompt and capture the outputs. A simple script or even a spreadsheet of inputs and pasted outputs works for a first session. Do not over-tool this; the right metrics and instrumentation can come once you know the work is worth investing in.

An Adversarial Mindset

The prerequisite that matters most is intent. You are not testing whether the prompt works on cooperative input. You are testing whether it survives a user who is careless, confused, or hostile. Adopt that posture before you write a single attack.

Your First Adversarial Session

Start With the Obvious Attacks

Begin with the classic moves: ask the model to ignore its instructions, ask it to reveal its system prompt, feed it contradictory commands, and send input far outside its intended scope. These are crude, but they catch a surprising number of real failures and build your confidence that the method works.

Push on Your Specific Boundaries

Next, attack the rules unique to your prompt. If it must never quote a price, try to extract a price. If it must stay on one topic, drag it elsewhere. If it must follow a format, send input designed to break the format. Your most valuable attacks target your own constraints.

Record Everything

For each attempt, log the input, the output, and whether it was a failure by your definition. This record is the seed of a real suite and the evidence you will use to make the business case for continuing.

Turning One Finding Into a Result

Reproduce Before You Fix

When you find a failure, run the same input several times. Language models are stochastic, so confirm the failure reproduces rather than chasing a one-off. A failure that appears half the time is still a failure worth fixing.

Fix, Then Re-Test

Adjust the prompt to close the hole, then re-run the exact attack to confirm the fix holds. Crucially, re-run your earlier passing attacks too — fixes often introduce regressions elsewhere. This re-test loop is the kernel of a real program.

Save the Attack

Every confirmed failure becomes a permanent test. Keep it so that future prompt changes get checked against it automatically. A growing file of saved attacks is how a one-time session becomes an ongoing team practice.

Building the Habit

Test on Every Prompt Change

The single most valuable habit is re-running your saved attacks whenever you change the prompt. This is cheap and catches the regressions that cause most real-world surprises.

Grow the Suite From Reality

Whenever a real user surfaces a problem your suite missed, add it as a new attack. Over time your suite comes to reflect your actual exposure rather than generic textbook attacks.

Know When to Level Up

Once the basics feel routine, the advanced techniques — generated attacks, multi-turn pressure, system-level testing — give you depth. But none of that matters until you have caught and fixed your first failure by hand.

A Concrete First-Hour Walkthrough

Pick the Highest-Exposure Prompt

Do not start with the prompt that is easiest to test; start with the one that would cause the most damage if it failed. The customer-facing answer generator, the prompt that touches money or policy, the one that summarizes documents people act on. Targeting your highest-exposure prompt means even a short session produces a finding that matters.

Spend Twenty Minutes Attacking, Not Planning

The most common way a first session fails is over-planning. Resist the urge to design a perfect suite. Open the prompt, spend twenty focused minutes throwing crude attacks and your own boundary violations at it, and capture whatever breaks. Momentum beats methodology in the first hour.

Triage What You Found

At the end of the session you will likely have several odd outputs. Sort them by your severity definition — which would actually hurt a customer or the business, and which are merely cosmetic. The high-severity ones are your first fixes; the cosmetic ones go on a backlog. This triage habit is what keeps a growing program focused on what counts.

Avoiding Early Mistakes

Do Not Confuse a Weird Output With a Failure

Not every strange response is a failure. Judge each one against your written definition of failure, not against your gut reaction. A surprising but acceptable answer is not a defect, and chasing it wastes the session.

Do Not Skip the Re-Test

The most tempting shortcut is to fix a prompt and assume the fix worked. Always re-run the exact attack and your earlier passing attacks. Fixes that close one hole and open another are extremely common, and skipping the re-test is how they reach production.

Do Not Try to Be Comprehensive on Day One

A first session that aims for full coverage produces nothing. Aim for one real failure, fix it, and save the attack. Coverage is a destination you reach over many sessions, not a starting requirement.

Frequently Asked Questions

Do I need security expertise to start?

No. A first session needs a real prompt, a clear definition of failure, and the willingness to attack your own work. Security depth helps later, but the highest-value early failures come from simple, obvious attacks anyone can run.

What should my very first attack be?

Try to make the model ignore its instructions and reveal its system prompt. It is crude, but it catches a surprising number of real weaknesses and confirms the method works on your prompt.

How do I know if an odd output counts as a failure?

Define failure before you start. Decide what unacceptable looks like — leaked instructions, off-topic answers, fabricated facts, broken format — and judge each output against that definition rather than your gut.

How many attacks make a useful first session?

Enough to find one real, reproducible failure. That is the entire goal of session one. Comprehensive coverage comes later; proof of fragility comes first.

What do I do once I find a failure?

Reproduce it across multiple runs, fix the prompt, re-test the exact attack, and re-run your earlier passing attacks to catch regressions. Then save the attack as a permanent test.

How do I keep this from being a one-time exercise?

Re-run your saved attacks on every prompt change and add a new attack each time a real user surfaces a problem you missed. That single habit turns a session into a program.

Key Takeaways

  • You need a real prompt, a written definition of failure, and an adversarial mindset — not a lab.
  • The goal of session one is a single real, reproducible failure, not full coverage.
  • Start with crude attacks, then target the constraints unique to your own prompt.
  • Reproduce every failure across multiple runs before fixing, since models are stochastic.
  • After fixing, re-run earlier passing attacks to catch regressions you introduced.
  • Save every confirmed failure as a permanent test and re-run it on every prompt change.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification