How One Team Closed a Live Injection Hole in Their Agent

The clearest way to understand prompt injection defense is to follow one team through a real fix from start to finish. This case study traces a composite scenario drawn from the way these incidents actually unfold: an AI agent that worked beautifully in demos, quietly carried a serious vulnerability into production, got exploited, and was rebuilt to be defensible. The names and specifics are generalized, but the arc—situation, discovery, decision, execution, outcome—reflects the genuine shape of this work.

Read it as a template. The team's reasoning at each fork is more valuable than the particular tools they reached for, because your system will differ in detail while facing the same fundamental choices.

What follows is the situation they inherited, the moment they realized something was wrong, the decisions they debated, what they actually built, and the results they could measure afterward.

The Situation

A mid-sized software company built an internal AI assistant to help its operations team. The assistant could read tickets from the support queue, look up customer records, draft responses, and—the feature everyone loved—automatically apply small account adjustments like extending a trial or issuing a modest credit.

How It Was Built

The whole thing ran as a single model loop. The assistant read the ticket, pulled relevant records, decided on an action, and executed it. The system prompt instructed it to apply credits only up to a small limit and to escalate anything larger. In demos and early use, it worked flawlessly and saved the team real time.

The Latent Flaw

Tickets are written by customers. They are untrusted input. And the same model that read those tickets also held the authority to adjust accounts. Untrusted input and a powerful action lived in one undivided loop, protected only by an instruction in the prompt. Nobody had framed it that way during the build.

The Discovery

The problem surfaced when an analyst noticed a cluster of unusually large credits applied to several accounts over a single weekend.

Tracing the Incident

Reviewing the logs—which, fortunately, captured the assistant's actions—the team found that the affected tickets all contained a similar passage: text instructing the assistant to disregard its credit limit and apply a large credit "as previously authorized by the account manager." The model had read the instruction inside the ticket and followed it, treating customer-supplied text as a command.

The Realization

This was textbook indirect prompt injection. The attackers never accessed any system directly. They simply submitted support tickets, and the assistant did the rest. The credit limit in the prompt had been worthless because the same text channel carrying the data also carried the override.

The Decision

Under pressure to restore the feature safely, the team debated three paths.

The Options on the Table

The first option was to harden the prompt with stronger wording forbidding overrides. The second was to add a keyword filter that blocked tickets mentioning credits and authorization. The third, more invasive, was to re-architect so the model reading tickets could no longer apply adjustments at all.

Why They Chose Re-Architecture

The team recognized that the first two options treated symptoms. Prompt wording could be paraphrased past, and keyword filters could be evaded by rephrasing. Only the structural change addressed the root cause—untrusted input wired directly to a powerful action. They accepted the larger effort in exchange for a defense that did not depend on outguessing attackers.

The Execution

The rebuild centered on privilege separation, with supporting layers around it.

Splitting Read From Act

They divided the assistant into two stages. A reading stage processed tickets and produced a structured recommendation—action type, amount, and a justification—but had no power to execute anything. A separate acting stage took only that structured recommendation, never the raw ticket text, and applied the action against hard-coded limits enforced in code rather than in a prompt.

Adding Validation and Gates

Any recommended credit above the small limit was routed to a human queue, enforced by the acting stage's code regardless of what the recommendation claimed. Outputs from the reading stage had to pass schema validation, so malformed or out-of-range values were rejected outright. The team also kept and expanded the action logging that had made the incident traceable in the first place.

Red-Teaming Before Relaunch

Before turning the feature back on, they assembled a set of injection attempts modeled on the original attack plus variations—encoded payloads, different phrasings, instructions split across multiple tickets—and confirmed that none could push an action past the code-enforced limits.

The Outcome

The rebuilt assistant returned to production with measurably different properties.

What Changed Measurably

Unauthorized adjustments dropped to zero in the months after relaunch, because the limit was now enforced in code that untrusted input could not reach. The adversarial test suite, run on every change, caught two regressions during a later model upgrade before they shipped. The action logs, now standard practice, cut incident investigation time from days to hours.

The Lessons That Generalized

The team's takeaway was that the original feature had not been insecurely worded—it had been insecurely structured. No amount of prompt cleverness would have fixed a design that fused untrusted input with a powerful action. The durable fix was architectural, and it made future incidents survivable rather than catastrophic.

How the Team Changed Its Process

The incident reshaped more than one feature. It changed how the team built every AI capability that followed.

A New Design Checkpoint

The team added a standing question to every AI feature design review: does this component read untrusted content, and if so, what is the worst action it can take on its own? Any feature that combined the two had to justify a containment plan before it could ship. This converted the painful lesson into a repeatable gate rather than relying on anyone remembering the incident.

Logging and Testing Became Defaults

Action logging, which had been an afterthought that happened to save them, became a non-negotiable requirement for any feature that could take an action. The adversarial test suite became a shared asset that every new feature contributed to and ran against. What had been one team's hard-won fix turned into the organization's default posture.

What Other Teams Can Borrow

The specifics of this case—support tickets, account credits—are particular, but the reasoning transfers directly to almost any AI feature.

Find Your Version of the Same Flaw

Most AI applications have a place where untrusted content and a consequential action meet. It might be tickets and credits, or documents and approvals, or messages and outbound email. The exercise is to locate that meeting point in your own system and ask whether anything but a prompt instruction stands between them. If the answer is no, you have found your version of this incident before it happens.

Apply the Same Sequence of Fixes

The team's path—separate reading from acting, enforce limits in code, validate the handoff, gate high stakes to humans, and confirm with adversarial testing—is a template you can follow regardless of domain. The order matters: containment first, then detection and testing around it. Borrowing the sequence is more valuable than borrowing the particular tools, because your tools will differ while the structure stays the same.

This narrative puts the principles from Prompt Injection Defense: Best Practices That Actually Work into motion, follows the build order in A Step-by-Step Approach to Prompt Injection Defense, and avoids the traps catalogued in 7 Common Mistakes with Prompt Injection Defense (and How to Avoid Them).

Frequently Asked Questions

Could a better-written prompt have prevented this incident?

No. The attack worked precisely because prompt instructions are suggestions the model can be talked out of. A stronger limit in the prompt would have been bypassed by rephrasing. Only enforcing the limit in code, outside the model's reach, closed the hole.

Why was logging so important to the response?

The action logs were what let the team trace the incident to its source and understand the attack within hours instead of guessing for days. Without them, the cluster of large credits would have been far harder to explain. Logging turns silent compromises into investigable events.

Was the re-architecture worth the extra effort over a quick patch?

Yes. The quick patches—prompt hardening and keyword filtering—would have failed against a motivated attacker and given false confidence. The structural fix eliminated the root cause and made the system resilient to attack variations the team had not anticipated.

How did they know the fix actually worked?

They built an adversarial test suite based on the real attack plus variations and confirmed none could push an action past the code-enforced limits. Running that suite continuously also caught two later regressions during a model upgrade.

Key Takeaways

The assistant was insecurely structured, not insecurely worded—untrusted ticket text was wired directly to a powerful action.
A credit limit living in the prompt was worthless because the same channel that carried data carried the override.
The durable fix was privilege separation: a reading stage with no power, and an acting stage enforcing limits in code on validated input.
Action logging made the incident traceable in hours and became standard practice afterward.
A continuous adversarial test suite confirmed the fix and later caught two regressions during a model upgrade.

What follows is the situation they inherited, the moment they realized something was wrong, the decisions they debated, what they actually built, and the results they could measure afterward.

The Situation

How It Was Built

The Latent Flaw

The Discovery

The problem surfaced when an analyst noticed a cluster of unusually large credits applied to several accounts over a single weekend.

Tracing the Incident

The Realization

The Decision

Under pressure to restore the feature safely, the team debated three paths.

The Options on the Table

Why They Chose Re-Architecture

The Execution

The rebuild centered on privilege separation, with supporting layers around it.

Splitting Read From Act

Adding Validation and Gates

Red-Teaming Before Relaunch

The Outcome

The rebuilt assistant returned to production with measurably different properties.

What Changed Measurably

The Lessons That Generalized

How the Team Changed Its Process

The incident reshaped more than one feature. It changed how the team built every AI capability that followed.

A New Design Checkpoint

Logging and Testing Became Defaults

What Other Teams Can Borrow

The specifics of this case—support tickets, account credits—are particular, but the reasoning transfers directly to almost any AI feature.

Find Your Version of the Same Flaw

Apply the Same Sequence of Fixes

Frequently Asked Questions

Could a better-written prompt have prevented this incident?

Why was logging so important to the response?

Was the re-architecture worth the extra effort over a quick patch?

How did they know the fix actually worked?

Key Takeaways

The assistant was insecurely structured, not insecurely worded—untrusted ticket text was wired directly to a powerful action.
A credit limit living in the prompt was worthless because the same channel that carried data carried the override.
The durable fix was privilege separation: a reading stage with no power, and an acting stage enforcing limits in code on validated input.
Action logging made the incident traceable in hours and became standard practice afterward.
A continuous adversarial test suite confirmed the fix and later caught two regressions during a model upgrade.

How One Team Closed a Live Injection Hole in Their Agent

The Situation

How It Was Built

The Latent Flaw

The Discovery

Tracing the Incident

The Realization

The Decision

The Options on the Table

Why They Chose Re-Architecture

The Execution

Splitting Read From Act

Adding Validation and Gates

Red-Teaming Before Relaunch

The Outcome

What Changed Measurably

The Lessons That Generalized

How the Team Changed Its Process

A New Design Checkpoint

Logging and Testing Became Defaults

What Other Teams Can Borrow

Find Your Version of the Same Flaw

Apply the Same Sequence of Fixes

Frequently Asked Questions

Could a better-written prompt have prevented this incident?

Why was logging so important to the response?

Was the re-architecture worth the extra effort over a quick patch?

How did they know the fix actually worked?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

How One Team Closed a Live Injection Hole in Their Agent

The Situation

How It Was Built

The Latent Flaw

The Discovery

Tracing the Incident

The Realization

The Decision

The Options on the Table

Why They Chose Re-Architecture

The Execution

Splitting Read From Act

Adding Validation and Gates

Red-Teaming Before Relaunch

The Outcome

What Changed Measurably

The Lessons That Generalized

How the Team Changed Its Process

A New Design Checkpoint

Logging and Testing Became Defaults

What Other Teams Can Borrow

Find Your Version of the Same Flaw

Apply the Same Sequence of Fixes

Frequently Asked Questions

Could a better-written prompt have prevented this incident?

Why was logging so important to the response?

Was the re-architecture worth the extra effort over a quick patch?

How did they know the fix actually worked?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?