AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Indirect Injection at DepthMulti-hop content chainsEncoding and obfuscationMulti-Agent Trust BoundariesInjection that propagatesConfused-deputy patternsSubtle Detection FailuresAdversarial driftFalse-positive blindnessOver-trusting structured outputHardening the Containment LayerCapability scoping by contextBudgets and circuit breakersOutput as data, never as commandOperating at the Edge of KnowledgeAssume your model of the system is incompleteTreat novel attacks as inevitable, not exceptionalShare and absorb attack intelligenceFrequently Asked QuestionsWhy does separation stop working at scale?How do I defend a chain of agents?Are encoding-based attacks worth worrying about?Is perfect prompt injection defense achievable?Key Takeaways
Home/Blog/Edge Cases That Break Naive Injection Defenses
General

Edge Cases That Break Naive Injection Defenses

A

Agency Script Editorial

Editorial Team

·October 24, 2023·6 min read
prompt injection defenseprompt injection defense advancedprompt injection defense guideprompt engineering

You already separate untrusted input from instructions, scope tools to least privilege, and gate dangerous actions. Those fundamentals stop most attacks. This article is about the ones they do not stop—the edge cases that defeat textbook defenses and the nuances that distinguish a security-aware team from a checklist-following one.

The pattern across all of these is the same: attackers move to whatever channel you trust by default. Once you delimit the chat box, they hide in documents. Once you sanitize documents, they exploit the boundary between agents. Advanced defense is a habit of asking, for every trusted channel, what would happen if it were hostile.

This assumes you have the foundation in place. If any of it is shaky, shore it up with Prompt Injection Defense: Best Practices That Actually Work first.

Indirect Injection at Depth

Multi-hop content chains

An attack does not need to land in the document your agent reads directly. It can sit in a page that document links to, which your agent fetches automatically, which references another source. Each hop is a place sanitization might be skipped because the content felt one step removed from the user.

  • Defend by treating every hop as untrusted, not just the first retrieval. Trust does not decay gracefully across links; it has to be re-established at each fetch.
  • Cap retrieval depth and breadth. An agent that follows links indefinitely is an agent that will eventually follow a hostile one.

Encoding and obfuscation

Hidden instructions arrive base64-encoded, split across tokens, embedded in invisible Unicode, or expressed in a language your filters do not scan. A classifier tuned for English plaintext misses an injection written in homoglyphs.

  • Normalize before you scan. Decode, strip zero-width characters, and canonicalize Unicode before any pattern matching runs.
  • Assume your filter has blind spots and rely on structural containment to catch what detection misses.

Multi-Agent Trust Boundaries

Injection that propagates

When one agent's output becomes another agent's instruction, a single compromise can ripple through a chain. The planner gets injected, emits a poisoned plan, and the executor faithfully carries it out. The executor's own defenses never trigger because the instruction came from a trusted peer.

  • Re-validate at every handoff. Treat the output of an upstream agent as untrusted input to the downstream one, with the same delimiting and scoping you apply to user input.
  • Constrain what one agent can ask another to do. An executor should accept a bounded vocabulary of actions, not arbitrary natural-language commands.

Confused-deputy patterns

A privileged agent acting on behalf of a less-privileged request can be tricked into using its own elevated permissions. The agent has access the user does not, and the injection borrows it.

  • Push the user's identity through the whole chain so every action is authorized against the original requester, not the agent's service identity. This is the structural fix the framework in A Framework for Prompt Injection Defense calls Limit.

Subtle Detection Failures

Adversarial drift

Your red-team suite tests yesterday's attacks. Attackers iterate. A suite that does not grow gives a falsely rising block rate as your real exposure widens. Treat the suite as a living artifact and add every novel payload you encounter or imagine.

False-positive blindness

Aggressive detection that blocks legitimate requests damages the product silently—users leave rather than report. Advanced teams track false positives as carefully as block rate and read them together, a discipline detailed in How to Measure Prompt Injection Defense: Metrics That Matter.

Over-trusting structured output

Pinning output to JSON helps, but a model can still emit a schema-valid object with hostile field values. Validate the semantics, not just the shape—confirm the tool arguments are within expected bounds before execution.

Hardening the Containment Layer

When detection inevitably misses, containment is what stands between an injection and real harm. Advanced teams invest disproportionately here, because it is the layer that holds when everything upstream fails.

Capability scoping by context

A blanket allowlist of tools is a start, but sophisticated systems scope capabilities by state. An agent mid-conversation about a refund should have the refund tool available only within that flow, with the amount bounded by the order in question, and have it disappear entirely outside that context. Static permissions are a coarse instrument; dynamic, context-bound capabilities shrink the window in which any single tool can be abused.

Budgets and circuit breakers

Beyond per-call gating, impose aggregate limits—spend caps, action counts, rate ceilings—that trip automatically when crossed. A hijacked agent that can issue one questionable refund is a problem; one that can issue a thousand before anyone notices is a catastrophe. Circuit breakers convert a runaway attack into a contained, self-halting one, buying time for human response.

Output as data, never as command

A persistent advanced failure is letting model output flow into a privileged sink unchecked. Even schema-valid output should be treated as untrusted data that your code interprets within strict bounds, never as a command your system executes verbatim. The discipline is to keep the model advisory: it proposes, your deterministic code disposes, and the gap between the two is where containment lives.

Operating at the Edge of Knowledge

Advanced defense is partly technical and partly a stance toward the unknown. The attacks that matter most are the ones nobody has documented yet, which means the durable advantage is a way of thinking rather than a fixed list of countermeasures.

Assume your model of the system is incomplete

Every defense rests on a mental model of what the system can do and what an attacker can reach. That model is always slightly wrong, and the gap between it and reality is where injections live. Advanced practitioners periodically try to falsify their own model—mapping every path data takes, every place model output flows, every channel they assumed was safe. The exercise routinely surfaces a trusted channel that was never actually validated.

Treat novel attacks as inevitable, not exceptional

Because the threat evolves, the right posture is not to enumerate every attack but to build a system that fails safe against attacks you have not imagined. Strong containment does this: an unforeseen injection that fully hijacks the model still cannot exceed the bounds your enforcement and limit layers impose. Investing in containment is investing against the unknown, which is why mature teams weight it so heavily relative to detection.

Share and absorb attack intelligence

The frontier moves through example. Reading write-ups of novel attacks, contributing the ones you discover, and folding both into your red-team suite is how a team stays current. A defense informed only by what you personally encountered will always lag the collective ingenuity of attackers; a defense fed by the wider community keeps closer pace.

Frequently Asked Questions

Why does separation stop working at scale?

Separation works fine; attackers just relocate. Once the chat box is delimited, the hostile instruction moves to a retrieved document, then to a linked page, then to an upstream agent's output. Separation has to be applied at every trust boundary, not only the obvious one, or attackers simply use the boundary you forgot.

How do I defend a chain of agents?

Treat each agent's output as untrusted input to the next, re-applying delimiting and scoping at every handoff. Constrain downstream agents to a bounded action vocabulary rather than free natural-language commands, and propagate the original user's identity through the chain so privileges never get borrowed by a deputy.

Are encoding-based attacks worth worrying about?

Yes, especially as agents ingest content from many sources. Hidden instructions in base64, invisible Unicode, or non-English text routinely slip past filters tuned for plaintext English. Normalize and canonicalize all input before scanning, and lean on structural containment so a missed encoded attack still cannot cause real damage.

Is perfect prompt injection defense achievable?

No, and advanced practice accepts this. The realistic goal is to make casual attacks fail, make serious attacks hard, and ensure that any attack which succeeds has a bounded blast radius. Defense is risk management, not elimination, and teams that chase perfection often neglect the containment that actually limits harm.

Key Takeaways

  • Attackers migrate to whatever channel you trust by default; defend every boundary, not just the chat box.
  • Indirect injection runs deep—treat every retrieval hop and every agent handoff as untrusted.
  • Normalize and canonicalize input before scanning to catch encoded and obfuscated payloads.
  • Propagate user identity through agent chains to defeat confused-deputy attacks.
  • Validate output semantics, not just schema shape, and keep your red-team suite alive.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

General

Prompt Quality Decides Whether AI Earns Its Keep

Prompt quality is the single biggest variable in whether AI delivers real work or expensive noise. The model matters, the platform matters — but the prompt you write determines whether you get a first

A
Agency Script Editorial
June 1, 2026·10 min read
General

Counting the Real Cost of Every Token You Send

Tokens and context windows sit at the intersection of AI capability and operational cost—yet most business cases treat them as technical footnotes. That's a mistake that costs real money. Every time y

A
Agency Script Editorial
June 1, 2026·10 min read
General

Rolling Out AI Hallucinations Across a Team

Most teams discover AI hallucinations the hard way — a confident-sounding wrong answer makes it into a client deliverable, a legal brief, or a published report. The damage isn't just to the output; it

A
Agency Script Editorial
June 1, 2026·11 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification