Edge Cases That Break Naive Injection Defenses

You already separate untrusted input from instructions, scope tools to least privilege, and gate dangerous actions. Those fundamentals stop most attacks. This article is about the ones they do not stop—the edge cases that defeat textbook defenses and the nuances that distinguish a security-aware team from a checklist-following one.

The pattern across all of these is the same: attackers move to whatever channel you trust by default. Once you delimit the chat box, they hide in documents. Once you sanitize documents, they exploit the boundary between agents. Advanced defense is a habit of asking, for every trusted channel, what would happen if it were hostile.

This assumes you have the foundation in place. If any of it is shaky, shore it up with Prompt Injection Defense: Best Practices That Actually Work first.

Indirect Injection at Depth

Multi-hop content chains

An attack does not need to land in the document your agent reads directly. It can sit in a page that document links to, which your agent fetches automatically, which references another source. Each hop is a place sanitization might be skipped because the content felt one step removed from the user.

Defend by treating every hop as untrusted, not just the first retrieval. Trust does not decay gracefully across links; it has to be re-established at each fetch.
Cap retrieval depth and breadth. An agent that follows links indefinitely is an agent that will eventually follow a hostile one.

Encoding and obfuscation

Hidden instructions arrive base64-encoded, split across tokens, embedded in invisible Unicode, or expressed in a language your filters do not scan. A classifier tuned for English plaintext misses an injection written in homoglyphs.

Normalize before you scan. Decode, strip zero-width characters, and canonicalize Unicode before any pattern matching runs.
Assume your filter has blind spots and rely on structural containment to catch what detection misses.

Multi-Agent Trust Boundaries

Injection that propagates

When one agent's output becomes another agent's instruction, a single compromise can ripple through a chain. The planner gets injected, emits a poisoned plan, and the executor faithfully carries it out. The executor's own defenses never trigger because the instruction came from a trusted peer.

Re-validate at every handoff. Treat the output of an upstream agent as untrusted input to the downstream one, with the same delimiting and scoping you apply to user input.
Constrain what one agent can ask another to do. An executor should accept a bounded vocabulary of actions, not arbitrary natural-language commands.

Confused-deputy patterns

A privileged agent acting on behalf of a less-privileged request can be tricked into using its own elevated permissions. The agent has access the user does not, and the injection borrows it.

Push the user's identity through the whole chain so every action is authorized against the original requester, not the agent's service identity. This is the structural fix the framework in A Framework for Prompt Injection Defense calls Limit.

Subtle Detection Failures

Adversarial drift

Your red-team suite tests yesterday's attacks. Attackers iterate. A suite that does not grow gives a falsely rising block rate as your real exposure widens. Treat the suite as a living artifact and add every novel payload you encounter or imagine.

False-positive blindness

Aggressive detection that blocks legitimate requests damages the product silently—users leave rather than report. Advanced teams track false positives as carefully as block rate and read them together, a discipline detailed in How to Measure Prompt Injection Defense: Metrics That Matter.

Over-trusting structured output

Pinning output to JSON helps, but a model can still emit a schema-valid object with hostile field values. Validate the semantics, not just the shape—confirm the tool arguments are within expected bounds before execution.

Hardening the Containment Layer

When detection inevitably misses, containment is what stands between an injection and real harm. Advanced teams invest disproportionately here, because it is the layer that holds when everything upstream fails.

Capability scoping by context

A blanket allowlist of tools is a start, but sophisticated systems scope capabilities by state. An agent mid-conversation about a refund should have the refund tool available only within that flow, with the amount bounded by the order in question, and have it disappear entirely outside that context. Static permissions are a coarse instrument; dynamic, context-bound capabilities shrink the window in which any single tool can be abused.

Budgets and circuit breakers

Beyond per-call gating, impose aggregate limits—spend caps, action counts, rate ceilings—that trip automatically when crossed. A hijacked agent that can issue one questionable refund is a problem; one that can issue a thousand before anyone notices is a catastrophe. Circuit breakers convert a runaway attack into a contained, self-halting one, buying time for human response.

Output as data, never as command

A persistent advanced failure is letting model output flow into a privileged sink unchecked. Even schema-valid output should be treated as untrusted data that your code interprets within strict bounds, never as a command your system executes verbatim. The discipline is to keep the model advisory: it proposes, your deterministic code disposes, and the gap between the two is where containment lives.

Operating at the Edge of Knowledge

Advanced defense is partly technical and partly a stance toward the unknown. The attacks that matter most are the ones nobody has documented yet, which means the durable advantage is a way of thinking rather than a fixed list of countermeasures.

Assume your model of the system is incomplete

Every defense rests on a mental model of what the system can do and what an attacker can reach. That model is always slightly wrong, and the gap between it and reality is where injections live. Advanced practitioners periodically try to falsify their own model—mapping every path data takes, every place model output flows, every channel they assumed was safe. The exercise routinely surfaces a trusted channel that was never actually validated.

Treat novel attacks as inevitable, not exceptional

Because the threat evolves, the right posture is not to enumerate every attack but to build a system that fails safe against attacks you have not imagined. Strong containment does this: an unforeseen injection that fully hijacks the model still cannot exceed the bounds your enforcement and limit layers impose. Investing in containment is investing against the unknown, which is why mature teams weight it so heavily relative to detection.

The frontier moves through example. Reading write-ups of novel attacks, contributing the ones you discover, and folding both into your red-team suite is how a team stays current. A defense informed only by what you personally encountered will always lag the collective ingenuity of attackers; a defense fed by the wider community keeps closer pace.

Frequently Asked Questions

Why does separation stop working at scale?

Separation works fine; attackers just relocate. Once the chat box is delimited, the hostile instruction moves to a retrieved document, then to a linked page, then to an upstream agent's output. Separation has to be applied at every trust boundary, not only the obvious one, or attackers simply use the boundary you forgot.

How do I defend a chain of agents?

Treat each agent's output as untrusted input to the next, re-applying delimiting and scoping at every handoff. Constrain downstream agents to a bounded action vocabulary rather than free natural-language commands, and propagate the original user's identity through the chain so privileges never get borrowed by a deputy.

Are encoding-based attacks worth worrying about?

Yes, especially as agents ingest content from many sources. Hidden instructions in base64, invisible Unicode, or non-English text routinely slip past filters tuned for plaintext English. Normalize and canonicalize all input before scanning, and lean on structural containment so a missed encoded attack still cannot cause real damage.

Is perfect prompt injection defense achievable?

No, and advanced practice accepts this. The realistic goal is to make casual attacks fail, make serious attacks hard, and ensure that any attack which succeeds has a bounded blast radius. Defense is risk management, not elimination, and teams that chase perfection often neglect the containment that actually limits harm.

Key Takeaways

Attackers migrate to whatever channel you trust by default; defend every boundary, not just the chat box.
Indirect injection runs deep—treat every retrieval hop and every agent handoff as untrusted.
Normalize and canonicalize input before scanning to catch encoded and obfuscated payloads.
Propagate user identity through agent chains to defeat confused-deputy attacks.
Validate output semantics, not just schema shape, and keep your red-team suite alive.

This assumes you have the foundation in place. If any of it is shaky, shore it up with Prompt Injection Defense: Best Practices That Actually Work first.

Indirect Injection at Depth

Multi-hop content chains

Defend by treating every hop as untrusted, not just the first retrieval. Trust does not decay gracefully across links; it has to be re-established at each fetch.
Cap retrieval depth and breadth. An agent that follows links indefinitely is an agent that will eventually follow a hostile one.

Encoding and obfuscation

Normalize before you scan. Decode, strip zero-width characters, and canonicalize Unicode before any pattern matching runs.
Assume your filter has blind spots and rely on structural containment to catch what detection misses.

Multi-Agent Trust Boundaries

Injection that propagates

Re-validate at every handoff. Treat the output of an upstream agent as untrusted input to the downstream one, with the same delimiting and scoping you apply to user input.
Constrain what one agent can ask another to do. An executor should accept a bounded vocabulary of actions, not arbitrary natural-language commands.

Confused-deputy patterns

A privileged agent acting on behalf of a less-privileged request can be tricked into using its own elevated permissions. The agent has access the user does not, and the injection borrows it.

Push the user's identity through the whole chain so every action is authorized against the original requester, not the agent's service identity. This is the structural fix the framework in A Framework for Prompt Injection Defense calls Limit.

Subtle Detection Failures

Adversarial drift

False-positive blindness

Over-trusting structured output

Hardening the Containment Layer

Capability scoping by context

Budgets and circuit breakers

Output as data, never as command

Operating at the Edge of Knowledge

Assume your model of the system is incomplete

Treat novel attacks as inevitable, not exceptional

Frequently Asked Questions

Why does separation stop working at scale?

How do I defend a chain of agents?

Are encoding-based attacks worth worrying about?

Is perfect prompt injection defense achievable?

Key Takeaways

Attackers migrate to whatever channel you trust by default; defend every boundary, not just the chat box.
Indirect injection runs deep—treat every retrieval hop and every agent handoff as untrusted.
Normalize and canonicalize input before scanning to catch encoded and obfuscated payloads.
Propagate user identity through agent chains to defeat confused-deputy attacks.
Validate output semantics, not just schema shape, and keep your red-team suite alive.

Edge Cases That Break Naive Injection Defenses

Indirect Injection at Depth

Multi-hop content chains

Encoding and obfuscation

Multi-Agent Trust Boundaries

Injection that propagates

Confused-deputy patterns

Subtle Detection Failures

Adversarial drift

False-positive blindness

Over-trusting structured output

Hardening the Containment Layer

Capability scoping by context

Budgets and circuit breakers

Output as data, never as command

Operating at the Edge of Knowledge

Assume your model of the system is incomplete

Treat novel attacks as inevitable, not exceptional

Share and absorb attack intelligence

Frequently Asked Questions

Why does separation stop working at scale?

How do I defend a chain of agents?

Are encoding-based attacks worth worrying about?

Is perfect prompt injection defense achievable?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Edge Cases That Break Naive Injection Defenses

Indirect Injection at Depth

Multi-hop content chains

Encoding and obfuscation

Multi-Agent Trust Boundaries

Injection that propagates

Confused-deputy patterns

Subtle Detection Failures

Adversarial drift

False-positive blindness

Over-trusting structured output

Hardening the Containment Layer

Capability scoping by context

Budgets and circuit breakers

Output as data, never as command

Operating at the Edge of Knowledge

Assume your model of the system is incomplete

Treat novel attacks as inevitable, not exceptional

Share and absorb attack intelligence

Frequently Asked Questions

Why does separation stop working at scale?

How do I defend a chain of agents?

Are encoding-based attacks worth worrying about?

Is perfect prompt injection defense achievable?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?