Opinionated Safety Rules Learned From Real Attackers

Generic safety advice is everywhere and helps no one. "Be careful," "keep a human in the loop," "test thoroughly", true, useless, forgettable. The practices below are different. Each is opinionated, each comes with the reasoning that makes it stick, and each was learned from systems that survived contact with real users and real attackers.

These are not rules to follow blindly. They are positions, and I will tell you when each one stops applying. The goal is judgment, not compliance. If you only adopt the underlying reasoning, you will make better decisions than any checklist could give you, though we do have a checklist for when you want the compressed version.

Assume Every Input Is Adversarial Until Proven Otherwise

The reasoning: defensive design built around well-behaved input fails the moment input misbehaves, and at scale, all input eventually misbehaves. It is cheaper to design for hostility from the start than to retrofit it after an incident.

In practice this means treating uploaded files, fetched pages, and third-party data as potentially containing instructions you do not want followed. It does not mean paranoia about every keystroke from a logged-in user on a low-stakes feature. Calibrate to the channel: the more untrusted the source and the more powerful the downstream action, the more adversarial your assumptions should be.

Make the Model Propose, Never Dispose

The single most important architectural commitment. The model should generate intent; deterministic code should decide whether to act on it.

A support agent returns "issue a $40 refund," and your code checks the order exists, the amount is within policy, and the requester is authorized.
A content tool returns text, and your code validates and sanitizes it before anything renders.

This holds because it survives the model failing. When, not if, a prompt is injected or a hallucination occurs, the deterministic layer is the thing standing between a bad suggestion and a real consequence. Our step-by-step approach shows exactly how to build this wall.

Measure False Refusals as Carefully as Harmful Outputs

Most teams track one side of the ledger: did the model produce something harmful? Far fewer track the other: did it refuse something legitimate?

The reasoning: a model tuned only to minimize harm drifts toward refusing everything, which destroys the product and pushes users to unmonitored workarounds. Safety is a two-sided optimization. Keep a set of legitimate-but-sensitive requests in your evaluation suite and watch the refusal rate the way you watch the harm rate. When does this stop applying? Almost never, but in genuinely high-stakes domains, medical, legal, financial, you deliberately accept more false refusals because the cost asymmetry justifies it.

Build the Evaluation Set Before You Need It

The reasoning: you cannot improve what you cannot measure, and the moment you most need a measurement, after a near-miss, is the worst time to start building one.

A strong eval set has normal cases, edge cases, and attacks, and grows every time you find a new failure. Run it on every prompt change, model swap, and config edit. Treat a score regression the way you treat a failing test: it blocks the change. This is the discipline that separates teams that improve from teams that thrash. We treat it as the spine of our framework.

Tune Your Posture to the Stakes of the Action

A drafting assistant and a system that moves money do not deserve the same safety posture, and applying one posture to both either over-constrains the harmless tool or under-protects the dangerous one.

Low-stakes, reversible actions: lightweight controls, fast iteration.
High-stakes, irreversible actions: human checkpoints, tighter output constraints, confidence thresholds.

The mistake is uniformity. Decide the posture per action, explicitly, based on the cost of being wrong. This consciousness about trade-offs is what our common mistakes guide warns teams to preserve.

Log Enough to Reconstruct Any Decision

The reasoning: every system fails eventually, and the difference between a learning organization and a stuck one is whether you can investigate the failure afterward.

Log the prompt, the model output, the tool calls, and the final action for anything consequential. You are not just collecting data; you are building the source of your next round of eval cases. Real incidents are the best test cases you will ever get, and you only have them if you logged them. Mind the obvious constraint: log responsibly with respect to sensitive data and retention policy.

Red-Team on a Schedule, Not on a Whim

The reasoning: attackers and edge cases evolve, and a system that was probed once at launch is defended against last year's threats.

Set a recurring cadence, monthly or quarterly depending on stakes, where someone actively tries to break the system: inject prompts, elicit forbidden behavior, find the unguarded action. Feed every success back into the eval set. This turns safety from a static gate into a moving defense. See our case study for how one disciplined red-team cycle changed a deployment's trajectory.

Write the Spec Before the Prompt

The reasoning: a prompt written without a clear specification is just hopeful prose, and you cannot tell whether it succeeded because you never defined success. The spec is what makes the prompt testable.

Before you write a word of system prompt, write down the concrete behaviors the system must never exhibit and the real goal it must serve, separate from any metric. Then the prompt becomes an attempt to satisfy a known spec, and your evaluation set becomes a direct test of whether it did. Teams that skip this step end up debating prompt wording forever because they have no objective standard to settle the debate. When does this stop applying? Never. Even a throwaway tool benefits from a one-line statement of what it must never do.

Prefer Reversible Designs Over Perfect Ones

The reasoning: you will never make the model perfect, so the smarter investment is making its mistakes cheap to undo rather than chasing a failure rate of zero.

In practice this means favoring drafts over auto-sends, soft-deletes over hard-deletes, staged changes over direct writes, and rate caps over unlimited authority. A reversible mistake is an inconvenience; an irreversible one is an incident. This reframes the whole effort: instead of asking "how do I stop every failure," ask "how do I make each failure survivable." That question has answers, and they are far cheaper than perfection. The limit: some actions genuinely cannot be made reversible, and those are exactly the ones that deserve a human checkpoint and the tightest constraints you can apply.

Frequently Asked Questions

What is the one practice to adopt if I can only adopt one?

Make the model propose and deterministic code dispose. It is the practice that contains the damage from every other failure, so it gives the most protection per unit of effort. Everything else reduces the frequency of failures; this one bounds their consequences.

How adversarial is too adversarial?

When you start adding friction to genuinely trusted, low-stakes interactions, you have overshot. Calibrate to the channel: hostile assumptions for untrusted input feeding powerful actions, light touch for trusted input feeding harmless ones.

Why measure false refusals if my product is in a sensitive domain?

You still measure them; you just accept a higher number deliberately. The point is to make the trade-off consciously rather than letting the model drift into refusing everything because no one was watching that dimension.

How big should my evaluation set be to start?

Start with twenty to fifty cases spanning normal, edge, and attack categories. A small set run consistently beats a large set run never. Grow it from real incidents and red-team findings rather than trying to enumerate everything upfront.

Key Takeaways

Assume untrusted input is adversarial, calibrated to the channel and the power of the downstream action.
Make the model propose and deterministic code dispose; this bounds the cost of every failure.
Measure false refusals as carefully as harmful outputs, because over-refusal is a real failure.
Build the evaluation set before you need it and let it block regressions.
Log enough to reconstruct any decision, and red-team on a schedule to keep defenses current.

Assume Every Input Is Adversarial Until Proven Otherwise

Make the Model Propose, Never Dispose

The single most important architectural commitment. The model should generate intent; deterministic code should decide whether to act on it.

A support agent returns "issue a $40 refund," and your code checks the order exists, the amount is within policy, and the requester is authorized.
A content tool returns text, and your code validates and sanitizes it before anything renders.

Measure False Refusals as Carefully as Harmful Outputs

Most teams track one side of the ledger: did the model produce something harmful? Far fewer track the other: did it refuse something legitimate?

Build the Evaluation Set Before You Need It

The reasoning: you cannot improve what you cannot measure, and the moment you most need a measurement, after a near-miss, is the worst time to start building one.

Tune Your Posture to the Stakes of the Action

Low-stakes, reversible actions: lightweight controls, fast iteration.
High-stakes, irreversible actions: human checkpoints, tighter output constraints, confidence thresholds.

The mistake is uniformity. Decide the posture per action, explicitly, based on the cost of being wrong. This consciousness about trade-offs is what our common mistakes guide warns teams to preserve.

Log Enough to Reconstruct Any Decision

The reasoning: every system fails eventually, and the difference between a learning organization and a stuck one is whether you can investigate the failure afterward.

Red-Team on a Schedule, Not on a Whim

The reasoning: attackers and edge cases evolve, and a system that was probed once at launch is defended against last year's threats.

Write the Spec Before the Prompt

Prefer Reversible Designs Over Perfect Ones

The reasoning: you will never make the model perfect, so the smarter investment is making its mistakes cheap to undo rather than chasing a failure rate of zero.

Frequently Asked Questions

What is the one practice to adopt if I can only adopt one?

How adversarial is too adversarial?

Why measure false refusals if my product is in a sensitive domain?

How big should my evaluation set be to start?

Key Takeaways

Assume untrusted input is adversarial, calibrated to the channel and the power of the downstream action.
Make the model propose and deterministic code dispose; this bounds the cost of every failure.
Measure false refusals as carefully as harmful outputs, because over-refusal is a real failure.
Build the evaluation set before you need it and let it block regressions.
Log enough to reconstruct any decision, and red-team on a schedule to keep defenses current.

Opinionated Safety Rules Learned From Real Attackers

Assume Every Input Is Adversarial Until Proven Otherwise

Make the Model Propose, Never Dispose

Measure False Refusals as Carefully as Harmful Outputs

Build the Evaluation Set Before You Need It

Tune Your Posture to the Stakes of the Action

Log Enough to Reconstruct Any Decision

Red-Team on a Schedule, Not on a Whim

Write the Spec Before the Prompt

Prefer Reversible Designs Over Perfect Ones

Frequently Asked Questions

What is the one practice to adopt if I can only adopt one?

How adversarial is too adversarial?

Why measure false refusals if my product is in a sensitive domain?

How big should my evaluation set be to start?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Opinionated Safety Rules Learned From Real Attackers

Assume Every Input Is Adversarial Until Proven Otherwise

Make the Model Propose, Never Dispose

Measure False Refusals as Carefully as Harmful Outputs

Build the Evaluation Set Before You Need It

Tune Your Posture to the Stakes of the Action

Log Enough to Reconstruct Any Decision

Red-Team on a Schedule, Not on a Whim

Write the Spec Before the Prompt

Prefer Reversible Designs Over Perfect Ones

Frequently Asked Questions

What is the one practice to adopt if I can only adopt one?

How adversarial is too adversarial?

Why measure false refusals if my product is in a sensitive domain?

How big should my evaluation set be to start?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?