Named Safety Plays for When an Incident Lands on Your Desk

A playbook is not a philosophy. It's a set of named plays, each with a clear trigger, a named owner, and a defined sequence, so that when a situation arises nobody has to invent a response from scratch. Most AI safety guidance fails precisely here: it explains why safety matters and then leaves you staring at a blank page when an actual incident or deployment decision lands on your desk.

This playbook fixes that. It assumes you are deploying AI into real work with real consequences, not running a research lab. Each play below is something you can assign and run. We've ordered them roughly by when you'll need them, from before you ship to after something goes wrong.

If you're earlier in the journey and want the conceptual grounding first, start with Ai Safety and Alignment Basics: A Beginner's Guide and come back here when you're ready to operationalize.

The Operating Principle Behind Every Play

Before the plays, one principle that holds them together: match oversight to stakes. A model that drafts internal meeting notes needs almost no safety infrastructure. A model that screens job applicants needs a lot. The single most common failure is applying uniform process everywhere, which leads teams to either over-govern trivial uses or under-govern dangerous ones.

Every play in this book starts by asking: what's the worst plausible outcome here, and how reversible is it? That answer sets the intensity dial for everything else.

Play 1: The Pre-Deployment Risk Tier

Trigger: Anyone proposes putting an AI system into a workflow.

Owner: The product or process owner, with sign-off from a designated safety reviewer.

Sequence:

Classify the use into a tier. A simple three-tier scheme works: Low (cosmetic or easily reversible), Medium (affects work quality or efficiency), High (touches money, legal exposure, safety, or people's livelihoods).
For Medium and High, write a one-paragraph "unacceptable outcomes" statement in plain language.
Attach required controls to each tier (see Play 3).

This play prevents the most expensive mistake: discovering a High-tier use was deployed with Low-tier care. Tiering takes minutes and routes effort where it belongs.

Play 2: The Goal Specification Review

Trigger: Any use where the AI optimizes or scores something (ranking, prioritizing, recommending).

Owner: The person who defines the objective.

This is the alignment play. Whenever a system is told to maximize, rank, or score, ask the uncomfortable question: what behavior would technically satisfy this metric while violating its intent?

The Proxy Trap Checklist

Does the metric reward a measurable proxy that diverges from the real goal? (Engagement vs. genuine helpfulness.)
Could the system "win" by exploiting an edge case rather than doing the work?
Are there protected groups or outcomes the metric ignores entirely?

If any answer is yes, the objective needs revision before deployment, not after. This is the cheapest possible point to catch misalignment.

Play 3: The Guardrail Stack

Trigger: Deployment of any Medium- or High-tier use.

Owner: Engineering, with safety review.

Layer your controls. No single control is sufficient.

Input controls: validate and sanitize what reaches the model; block prompt-injection patterns.
Output controls: filter, fact-check against authoritative sources where possible, and flag low-confidence responses.
Human checkpoints: require human approval before any High-tier action becomes final.
Logging: capture inputs, outputs, and decisions for every consequential interaction.

The depth of this stack scales with the tier from Play 1. The detailed best-practice version lives in Ai Safety and Alignment Basics: Best Practices That Actually Work.

Play 4: The Red-Team Pass

Trigger: Before launching any High-tier use, and quarterly thereafter.

Owner: A reviewer who did not build the system.

Adversarial review by someone with fresh eyes. Have them deliberately try to make the system fail: feed it edge cases, attempt jailbreaks, push it toward biased or harmful outputs, and check what happens under ambiguity. Document every failure found and whether it was fixed or accepted as a known limitation.

The non-negotiable rule: the red-teamer cannot be the builder. People do not find the flaws in their own designs.

Play 5: The Monitoring and Drift Watch

Trigger: Continuously, once anything is in production.

Owner: Operations, with alerts routed to a named human.

Models behave differently as inputs shift over time, and vendor model updates can change behavior overnight. Track:

Output distributions and refusal rates over time.
User-reported errors and overrides.
Any change in the underlying model version.

A sudden shift in any of these is your early warning. Without monitoring, the first sign of a problem is usually an angry customer or a regulator.

Play 6: The Incident Response

Trigger: A safety failure is detected (harmful output, biased decision, breach).

Owner: A pre-named incident lead with authority to pull the system.

Sequence:

Contain: disable or roll back the system. Speed beats elegance here.
Assess: determine scope, who was affected, and whether harm is ongoing.
Notify: affected parties and, where required, regulators.
Root-cause: trace the failure to its source using your logs.
Remediate and learn: fix the cause and update the relevant play so it can't recur.

The authority to pull a system must be granted in advance. Negotiating permission mid-incident is how small failures become large ones.

Sequencing the Plays

Run them in this order for a new deployment: Play 1 (tier) → Play 2 (goals) → Play 3 (guardrails) → Play 4 (red team) → ship → Play 5 (monitor) ongoing → Play 6 (respond) as needed. For systems already live, start with Play 1 to triage your existing footprint, then backfill Plays 3 through 5 for anything Medium or High. The Framework for Ai Safety and Alignment Basics maps these plays to a reusable structure.

A worked example makes the sequence concrete. Say marketing wants an AI to auto-respond to inbound leads. Play 1 tiers it Medium: a bad reply is embarrassing and reversible, not catastrophic. Play 2 catches a proxy trap, optimizing for "fast replies" would reward speed over accuracy, so the objective is reframed around correct, on-brand responses. Play 3 adds output filtering and a human checkpoint for any reply that quotes pricing. Play 4 has a teammate try to make it promise discounts it shouldn't; two failures surface and get fixed. It ships. Play 5 monitors override rates, and when they spike after a vendor model update, the named owner investigates. No incident, because the plays caught the problems in order.

What the Plays Have in Common

Step back and the six plays share a spine. Each one converts a vague worry ("could this go wrong?") into a specific, ownable action. Each assigns accountability to a person, not a committee. And each produces an artifact, a tier classification, a goal review, a guardrail config, a red-team report, a monitoring dashboard, an incident log, that becomes evidence you can show a stakeholder or regulator. The playbook works not because any single play is clever, but because together they leave no gap where "someone should have caught that" can hide.

Frequently Asked Questions

How small can a team be to run this playbook?

A single person can run a lightweight version. The plays don't require headcount; they require named ownership and discipline. In a small team, one person may own several plays, but the red-team pass (Play 4) must always go to someone other than the builder.

Do I need every play for every project?

No. Plays 1 and 2 apply to everything. Plays 3 through 5 scale with the risk tier you assign. A Low-tier use might need only tiering, basic logging, and an owner. Forcing High-tier rigor onto trivial uses burns goodwill and slows adoption.

What's the most commonly skipped play?

The red-team pass. Teams are eager to ship and reluctant to invite someone to break their work. It's also the play that catches the failures the builders are blind to, which makes skipping it especially costly.

How often should I revisit the playbook itself?

Quarterly, plus after any incident. Model capabilities, vendor behavior, and regulations all move quickly. A play that was sufficient six months ago may now miss a new failure mode like a novel jailbreak technique.

Who should hold the authority to pull a system in production?

A pre-named incident lead with explicit, standing authority. The point of naming them in advance is that incidents are fast and political. If shutdown requires assembling approvers in the moment, the delay itself becomes the harm.

Key Takeaways

A playbook turns AI safety from a philosophy into named plays with triggers, owners, and sequences.
Match oversight intensity to stakes; uniform process either over-governs trivial uses or under-governs dangerous ones.
Tier every use first, then review goals for proxy traps before building anything.
Layer guardrails (input, output, human checkpoint, logging) and scale their depth to the risk tier.
Red-teaming must be done by someone other than the builder; it's the most-skipped and highest-value play.
Grant shutdown authority in advance so incident response is fast, not political.

If you're earlier in the journey and want the conceptual grounding first, start with Ai Safety and Alignment Basics: A Beginner's Guide and come back here when you're ready to operationalize.

The Operating Principle Behind Every Play

Every play in this book starts by asking: what's the worst plausible outcome here, and how reversible is it? That answer sets the intensity dial for everything else.

Play 1: The Pre-Deployment Risk Tier

Trigger: Anyone proposes putting an AI system into a workflow.

Owner: The product or process owner, with sign-off from a designated safety reviewer.

Sequence:

Classify the use into a tier. A simple three-tier scheme works: Low (cosmetic or easily reversible), Medium (affects work quality or efficiency), High (touches money, legal exposure, safety, or people's livelihoods).
For Medium and High, write a one-paragraph "unacceptable outcomes" statement in plain language.
Attach required controls to each tier (see Play 3).

This play prevents the most expensive mistake: discovering a High-tier use was deployed with Low-tier care. Tiering takes minutes and routes effort where it belongs.

Play 2: The Goal Specification Review

Trigger: Any use where the AI optimizes or scores something (ranking, prioritizing, recommending).

Owner: The person who defines the objective.

This is the alignment play. Whenever a system is told to maximize, rank, or score, ask the uncomfortable question: what behavior would technically satisfy this metric while violating its intent?

The Proxy Trap Checklist

Does the metric reward a measurable proxy that diverges from the real goal? (Engagement vs. genuine helpfulness.)
Could the system "win" by exploiting an edge case rather than doing the work?
Are there protected groups or outcomes the metric ignores entirely?

If any answer is yes, the objective needs revision before deployment, not after. This is the cheapest possible point to catch misalignment.

Play 3: The Guardrail Stack

Trigger: Deployment of any Medium- or High-tier use.

Owner: Engineering, with safety review.

Layer your controls. No single control is sufficient.

Input controls: validate and sanitize what reaches the model; block prompt-injection patterns.
Output controls: filter, fact-check against authoritative sources where possible, and flag low-confidence responses.
Human checkpoints: require human approval before any High-tier action becomes final.
Logging: capture inputs, outputs, and decisions for every consequential interaction.

The depth of this stack scales with the tier from Play 1. The detailed best-practice version lives in Ai Safety and Alignment Basics: Best Practices That Actually Work.

Play 4: The Red-Team Pass

Trigger: Before launching any High-tier use, and quarterly thereafter.

Owner: A reviewer who did not build the system.

The non-negotiable rule: the red-teamer cannot be the builder. People do not find the flaws in their own designs.

Play 5: The Monitoring and Drift Watch

Trigger: Continuously, once anything is in production.

Owner: Operations, with alerts routed to a named human.

Models behave differently as inputs shift over time, and vendor model updates can change behavior overnight. Track:

Output distributions and refusal rates over time.
User-reported errors and overrides.
Any change in the underlying model version.

A sudden shift in any of these is your early warning. Without monitoring, the first sign of a problem is usually an angry customer or a regulator.

Play 6: The Incident Response

Trigger: A safety failure is detected (harmful output, biased decision, breach).

Owner: A pre-named incident lead with authority to pull the system.

Sequence:

Contain: disable or roll back the system. Speed beats elegance here.
Assess: determine scope, who was affected, and whether harm is ongoing.
Notify: affected parties and, where required, regulators.
Root-cause: trace the failure to its source using your logs.
Remediate and learn: fix the cause and update the relevant play so it can't recur.

The authority to pull a system must be granted in advance. Negotiating permission mid-incident is how small failures become large ones.

Sequencing the Plays

What the Plays Have in Common

Frequently Asked Questions

How small can a team be to run this playbook?

Do I need every play for every project?

What's the most commonly skipped play?

How often should I revisit the playbook itself?

Who should hold the authority to pull a system in production?

Key Takeaways

A playbook turns AI safety from a philosophy into named plays with triggers, owners, and sequences.
Match oversight intensity to stakes; uniform process either over-governs trivial uses or under-governs dangerous ones.
Tier every use first, then review goals for proxy traps before building anything.
Layer guardrails (input, output, human checkpoint, logging) and scale their depth to the risk tier.
Red-teaming must be done by someone other than the builder; it's the most-skipped and highest-value play.
Grant shutdown authority in advance so incident response is fast, not political.

Named Safety Plays for When an Incident Lands on Your Desk

The Operating Principle Behind Every Play

Play 1: The Pre-Deployment Risk Tier

Play 2: The Goal Specification Review

The Proxy Trap Checklist

Play 3: The Guardrail Stack

Play 4: The Red-Team Pass

Play 5: The Monitoring and Drift Watch

Play 6: The Incident Response

Sequencing the Plays

What the Plays Have in Common

Frequently Asked Questions

How small can a team be to run this playbook?

Do I need every play for every project?

What's the most commonly skipped play?

How often should I revisit the playbook itself?

Who should hold the authority to pull a system in production?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Named Safety Plays for When an Incident Lands on Your Desk

The Operating Principle Behind Every Play

Play 1: The Pre-Deployment Risk Tier

Play 2: The Goal Specification Review

The Proxy Trap Checklist

Play 3: The Guardrail Stack

Play 4: The Red-Team Pass

Play 5: The Monitoring and Drift Watch

Play 6: The Incident Response

Sequencing the Plays

What the Plays Have in Common

Frequently Asked Questions

How small can a team be to run this playbook?

Do I need every play for every project?

What's the most commonly skipped play?

How often should I revisit the playbook itself?

Who should hold the authority to pull a system in production?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?