AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The SituationHow It Was BuiltThe Latent FlawThe DiscoveryTracing the IncidentThe RealizationThe DecisionThe Options on the TableWhy They Chose Re-ArchitectureThe ExecutionSplitting Read From ActAdding Validation and GatesRed-Teaming Before RelaunchThe OutcomeWhat Changed MeasurablyThe Lessons That GeneralizedHow the Team Changed Its ProcessA New Design CheckpointLogging and Testing Became DefaultsWhat Other Teams Can BorrowFind Your Version of the Same FlawApply the Same Sequence of FixesFrequently Asked QuestionsCould a better-written prompt have prevented this incident?Why was logging so important to the response?Was the re-architecture worth the extra effort over a quick patch?How did they know the fix actually worked?Key Takeaways
Home/Blog/How One Team Closed a Live Injection Hole in Their Agent
General

How One Team Closed a Live Injection Hole in Their Agent

A

Agency Script Editorial

Editorial Team

·December 7, 2023·7 min read
prompt injection defenseprompt injection defense case studyprompt injection defense guideprompt engineering

The clearest way to understand prompt injection defense is to follow one team through a real fix from start to finish. This case study traces a composite scenario drawn from the way these incidents actually unfold: an AI agent that worked beautifully in demos, quietly carried a serious vulnerability into production, got exploited, and was rebuilt to be defensible. The names and specifics are generalized, but the arc—situation, discovery, decision, execution, outcome—reflects the genuine shape of this work.

Read it as a template. The team's reasoning at each fork is more valuable than the particular tools they reached for, because your system will differ in detail while facing the same fundamental choices.

What follows is the situation they inherited, the moment they realized something was wrong, the decisions they debated, what they actually built, and the results they could measure afterward.

The Situation

A mid-sized software company built an internal AI assistant to help its operations team. The assistant could read tickets from the support queue, look up customer records, draft responses, and—the feature everyone loved—automatically apply small account adjustments like extending a trial or issuing a modest credit.

How It Was Built

The whole thing ran as a single model loop. The assistant read the ticket, pulled relevant records, decided on an action, and executed it. The system prompt instructed it to apply credits only up to a small limit and to escalate anything larger. In demos and early use, it worked flawlessly and saved the team real time.

The Latent Flaw

Tickets are written by customers. They are untrusted input. And the same model that read those tickets also held the authority to adjust accounts. Untrusted input and a powerful action lived in one undivided loop, protected only by an instruction in the prompt. Nobody had framed it that way during the build.

The Discovery

The problem surfaced when an analyst noticed a cluster of unusually large credits applied to several accounts over a single weekend.

Tracing the Incident

Reviewing the logs—which, fortunately, captured the assistant's actions—the team found that the affected tickets all contained a similar passage: text instructing the assistant to disregard its credit limit and apply a large credit "as previously authorized by the account manager." The model had read the instruction inside the ticket and followed it, treating customer-supplied text as a command.

The Realization

This was textbook indirect prompt injection. The attackers never accessed any system directly. They simply submitted support tickets, and the assistant did the rest. The credit limit in the prompt had been worthless because the same text channel carrying the data also carried the override.

The Decision

Under pressure to restore the feature safely, the team debated three paths.

The Options on the Table

The first option was to harden the prompt with stronger wording forbidding overrides. The second was to add a keyword filter that blocked tickets mentioning credits and authorization. The third, more invasive, was to re-architect so the model reading tickets could no longer apply adjustments at all.

Why They Chose Re-Architecture

The team recognized that the first two options treated symptoms. Prompt wording could be paraphrased past, and keyword filters could be evaded by rephrasing. Only the structural change addressed the root cause—untrusted input wired directly to a powerful action. They accepted the larger effort in exchange for a defense that did not depend on outguessing attackers.

The Execution

The rebuild centered on privilege separation, with supporting layers around it.

Splitting Read From Act

They divided the assistant into two stages. A reading stage processed tickets and produced a structured recommendation—action type, amount, and a justification—but had no power to execute anything. A separate acting stage took only that structured recommendation, never the raw ticket text, and applied the action against hard-coded limits enforced in code rather than in a prompt.

Adding Validation and Gates

Any recommended credit above the small limit was routed to a human queue, enforced by the acting stage's code regardless of what the recommendation claimed. Outputs from the reading stage had to pass schema validation, so malformed or out-of-range values were rejected outright. The team also kept and expanded the action logging that had made the incident traceable in the first place.

Red-Teaming Before Relaunch

Before turning the feature back on, they assembled a set of injection attempts modeled on the original attack plus variations—encoded payloads, different phrasings, instructions split across multiple tickets—and confirmed that none could push an action past the code-enforced limits.

The Outcome

The rebuilt assistant returned to production with measurably different properties.

What Changed Measurably

Unauthorized adjustments dropped to zero in the months after relaunch, because the limit was now enforced in code that untrusted input could not reach. The adversarial test suite, run on every change, caught two regressions during a later model upgrade before they shipped. The action logs, now standard practice, cut incident investigation time from days to hours.

The Lessons That Generalized

The team's takeaway was that the original feature had not been insecurely worded—it had been insecurely structured. No amount of prompt cleverness would have fixed a design that fused untrusted input with a powerful action. The durable fix was architectural, and it made future incidents survivable rather than catastrophic.

How the Team Changed Its Process

The incident reshaped more than one feature. It changed how the team built every AI capability that followed.

A New Design Checkpoint

The team added a standing question to every AI feature design review: does this component read untrusted content, and if so, what is the worst action it can take on its own? Any feature that combined the two had to justify a containment plan before it could ship. This converted the painful lesson into a repeatable gate rather than relying on anyone remembering the incident.

Logging and Testing Became Defaults

Action logging, which had been an afterthought that happened to save them, became a non-negotiable requirement for any feature that could take an action. The adversarial test suite became a shared asset that every new feature contributed to and ran against. What had been one team's hard-won fix turned into the organization's default posture.

What Other Teams Can Borrow

The specifics of this case—support tickets, account credits—are particular, but the reasoning transfers directly to almost any AI feature.

Find Your Version of the Same Flaw

Most AI applications have a place where untrusted content and a consequential action meet. It might be tickets and credits, or documents and approvals, or messages and outbound email. The exercise is to locate that meeting point in your own system and ask whether anything but a prompt instruction stands between them. If the answer is no, you have found your version of this incident before it happens.

Apply the Same Sequence of Fixes

The team's path—separate reading from acting, enforce limits in code, validate the handoff, gate high stakes to humans, and confirm with adversarial testing—is a template you can follow regardless of domain. The order matters: containment first, then detection and testing around it. Borrowing the sequence is more valuable than borrowing the particular tools, because your tools will differ while the structure stays the same.

This narrative puts the principles from Prompt Injection Defense: Best Practices That Actually Work into motion, follows the build order in A Step-by-Step Approach to Prompt Injection Defense, and avoids the traps catalogued in 7 Common Mistakes with Prompt Injection Defense (and How to Avoid Them).

Frequently Asked Questions

Could a better-written prompt have prevented this incident?

No. The attack worked precisely because prompt instructions are suggestions the model can be talked out of. A stronger limit in the prompt would have been bypassed by rephrasing. Only enforcing the limit in code, outside the model's reach, closed the hole.

Why was logging so important to the response?

The action logs were what let the team trace the incident to its source and understand the attack within hours instead of guessing for days. Without them, the cluster of large credits would have been far harder to explain. Logging turns silent compromises into investigable events.

Was the re-architecture worth the extra effort over a quick patch?

Yes. The quick patches—prompt hardening and keyword filtering—would have failed against a motivated attacker and given false confidence. The structural fix eliminated the root cause and made the system resilient to attack variations the team had not anticipated.

How did they know the fix actually worked?

They built an adversarial test suite based on the real attack plus variations and confirmed none could push an action past the code-enforced limits. Running that suite continuously also caught two later regressions during a model upgrade.

Key Takeaways

  • The assistant was insecurely structured, not insecurely worded—untrusted ticket text was wired directly to a powerful action.
  • A credit limit living in the prompt was worthless because the same channel that carried data carried the override.
  • The durable fix was privilege separation: a reading stage with no power, and an acting stage enforcing limits in code on validated input.
  • Action logging made the incident traceable in hours and became standard practice afterward.
  • A continuous adversarial test suite confirmed the fix and later caught two regressions during a model upgrade.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification