AI Incident Response Playbook for Agency-Delivered Systems

AI systems fail differently than traditional software. They do not just crash or return errors. They produce wrong answers confidently, drift in quality silently, and make decisions that cause real-world consequences before anyone notices.

When an AI incident occurs in a system the agency delivered, the client does not care about technical explanations. They care about three things: how fast the problem is contained, how clearly the situation is communicated, and whether the agency has a plan to prevent it from happening again.

An incident response playbook answers all three before the incident happens.

Why AI Incidents Are Different

Traditional software incidents are usually binary: the system works or it does not. AI incidents exist on a spectrum.

Silent degradation. A model's accuracy drops from 95% to 78% over several weeks because the input data distribution has shifted. No error is thrown. No alert fires. The system continues producing outputs, just worse ones.

Confident errors. The model produces an output that is completely wrong but presented with high confidence. Downstream systems or users act on the wrong output before anyone questions it.

Bias emergence. A system that performed fairly during testing begins producing biased outputs when exposed to production data that differs from the training distribution.

Cascade failures. An AI component produces unexpected output that breaks downstream systems in ways that were not anticipated during integration testing.

Adversarial exploitation. Users or external actors discover inputs that cause the model to behave in unintended or harmful ways.

Each of these failure modes requires a different detection and response approach, which is why a generic incident response plan does not work for AI systems.

The Incident Response Playbook

Section 1: Incident Classification

Not every issue requires the same response. Classify incidents by severity to ensure proportional response.

Severity 1 - Critical

AI system is completely unavailable
system is producing outputs that cause financial, legal, or safety harm
data breach or unauthorized data access involving the AI system
response: immediate, all-hands response

Severity 2 - High

significant degradation in AI output quality
system producing biased or discriminatory outputs
integration failure affecting business-critical workflows
response: within 1 hour during business hours, within 4 hours outside

Severity 3 - Medium

noticeable quality degradation that does not affect critical decisions
intermittent errors or timeouts
performance degradation below SLA thresholds
response: within 4 business hours

Severity 4 - Low

minor quality inconsistencies
cosmetic issues in AI outputs
non-critical feature malfunction
response: within 1 business day

Section 2: Detection and Alerting

Incidents that are not detected quickly cannot be resolved quickly.

Monitoring requirements:

model output quality metrics tracked continuously
API response times and error rates
data pipeline health and freshness
usage patterns and anomaly detection
cost monitoring for unexpected spikes
user feedback and complaint tracking

Alert thresholds:

Define specific thresholds that trigger alerts for each monitoring metric. These should be calibrated to avoid alert fatigue while catching real issues.

Example thresholds:

accuracy drops below 90% on the rolling 24-hour window
API error rate exceeds 2% over a 15-minute period
response latency p95 exceeds 5 seconds
data pipeline delay exceeds 2 hours
cost per day exceeds 150% of the trailing 7-day average

Section 3: Response Procedures

Immediate actions (first 30 minutes):

Acknowledge the incident and assign an incident commander
Assess severity using the classification criteria
Contain the impact (disable the system, switch to fallback, or route to manual processing)
Notify affected stakeholders based on the communication plan
Begin documenting in the incident log

Investigation (30 minutes to 4 hours):

Identify the root cause or most likely contributing factors
Determine the scope of impact (how many users, transactions, or decisions were affected)
Evaluate whether the containment action is sufficient or needs escalation
Develop a remediation plan with estimated timeline
Update stakeholders on findings and expected resolution

Remediation (4 hours to resolution):

Implement the fix in a staging environment
Validate the fix against the original failure case and regression tests
Deploy the fix to production with monitoring
Verify that the issue is resolved and performance has returned to normal
Notify stakeholders that the incident is resolved

Section 4: Communication Plan

Communication during an AI incident must be proactive, honest, and structured.

Internal communication:

incident channel or thread for real-time coordination
regular status updates every 30 minutes during active incidents
clear escalation path when the incident commander needs additional resources

Client communication:

initial notification within the SLA timeframe with what is known
regular updates even when the situation has not changed (silence breeds anxiety)
technical detail level appropriate for the audience
clear statement of impact, actions being taken, and expected timeline
honest acknowledgment when the cause is unknown or the timeline is uncertain

Communication templates:

Prepare templates for each severity level so that communication is fast and consistent during high-pressure situations.

Initial notification template:

"We have identified an issue with [system name] that is affecting [specific functionality]. The issue was detected at [time]. Our team is actively investigating. Current impact: [description]. We will provide an update within [timeframe]."

Update template:

"Update on [system name] incident: [Current status]. Root cause: [identified/under investigation]. Actions taken: [description]. Expected resolution: [timeframe/unknown]. Next update: [timeframe]."

Resolution template:

"The incident affecting [system name] has been resolved as of [time]. Root cause: [description]. Impact: [scope]. Preventive measures: [description]. We will provide a full post-incident report within [timeframe]."

Section 5: Post-Incident Review

Every Severity 1 and 2 incident should receive a formal post-incident review within one week of resolution.

Review content:

timeline of events from detection to resolution
root cause analysis
impact assessment (users affected, decisions impacted, financial cost)
evaluation of the response (what worked, what was slow, what was missed)
action items to prevent recurrence
updates to the playbook based on lessons learned

Share the review with the client for Severity 1 and 2 incidents. This demonstrates accountability and builds trust.

Section 6: Roles and Responsibilities

Incident Commander: Owns the incident from declaration to closure. Makes decisions about escalation, communication, and resource allocation.

Technical Lead: Leads the investigation and remediation effort. Provides technical updates to the incident commander.

Communication Lead: Handles all stakeholder communication. Ensures updates are timely and accurate.

On-Call Engineer: First responder for after-hours incidents. Performs initial assessment and escalation.

Define who fills each role, including backup assignments for when primary assignees are unavailable.

Section 7: Playbook Maintenance

The playbook is a living document. Update it:

after every significant incident (incorporate lessons learned)
when monitoring or alerting capabilities change
when new AI systems are deployed
when team members or roles change
at least quarterly for a general review

Client-Facing Incident Expectations

Include incident response terms in the client agreement:

defined SLAs for response and resolution by severity level
communication frequency during active incidents
post-incident reporting commitments
escalation contact information
exclusions (incidents caused by client actions, third-party outages, etc.)

Setting these expectations upfront prevents disputes during the stress of an actual incident.

The Trust Equation

How an agency handles incidents reveals more about its character than how it handles successes.

Agencies that respond quickly, communicate honestly, and learn systematically from failures build deeper client trust than agencies that never have incidents because they never monitor for them.

The playbook is not about preventing all failures. It is about ensuring that when failures occur, the agency's response is fast, transparent, and continuously improving.

An incident response playbook answers all three before the incident happens.

Why AI Incidents Are Different

Traditional software incidents are usually binary: the system works or it does not. AI incidents exist on a spectrum.

Confident errors. The model produces an output that is completely wrong but presented with high confidence. Downstream systems or users act on the wrong output before anyone questions it.

Bias emergence. A system that performed fairly during testing begins producing biased outputs when exposed to production data that differs from the training distribution.

Cascade failures. An AI component produces unexpected output that breaks downstream systems in ways that were not anticipated during integration testing.

Adversarial exploitation. Users or external actors discover inputs that cause the model to behave in unintended or harmful ways.

Each of these failure modes requires a different detection and response approach, which is why a generic incident response plan does not work for AI systems.

The Incident Response Playbook

Section 1: Incident Classification

Not every issue requires the same response. Classify incidents by severity to ensure proportional response.

Severity 1 - Critical

AI system is completely unavailable
system is producing outputs that cause financial, legal, or safety harm
data breach or unauthorized data access involving the AI system
response: immediate, all-hands response

Severity 2 - High

significant degradation in AI output quality
system producing biased or discriminatory outputs
integration failure affecting business-critical workflows
response: within 1 hour during business hours, within 4 hours outside

Severity 3 - Medium

noticeable quality degradation that does not affect critical decisions
intermittent errors or timeouts
performance degradation below SLA thresholds
response: within 4 business hours

Severity 4 - Low

minor quality inconsistencies
cosmetic issues in AI outputs
non-critical feature malfunction
response: within 1 business day

Section 2: Detection and Alerting

Incidents that are not detected quickly cannot be resolved quickly.

Monitoring requirements:

model output quality metrics tracked continuously
API response times and error rates
data pipeline health and freshness
usage patterns and anomaly detection
cost monitoring for unexpected spikes
user feedback and complaint tracking

Alert thresholds:

Define specific thresholds that trigger alerts for each monitoring metric. These should be calibrated to avoid alert fatigue while catching real issues.

Example thresholds:

accuracy drops below 90% on the rolling 24-hour window
API error rate exceeds 2% over a 15-minute period
response latency p95 exceeds 5 seconds
data pipeline delay exceeds 2 hours
cost per day exceeds 150% of the trailing 7-day average

Section 3: Response Procedures

Immediate actions (first 30 minutes):

Acknowledge the incident and assign an incident commander
Assess severity using the classification criteria
Contain the impact (disable the system, switch to fallback, or route to manual processing)
Notify affected stakeholders based on the communication plan
Begin documenting in the incident log

Investigation (30 minutes to 4 hours):

Identify the root cause or most likely contributing factors
Determine the scope of impact (how many users, transactions, or decisions were affected)
Evaluate whether the containment action is sufficient or needs escalation
Develop a remediation plan with estimated timeline
Update stakeholders on findings and expected resolution

Remediation (4 hours to resolution):

Implement the fix in a staging environment
Validate the fix against the original failure case and regression tests
Deploy the fix to production with monitoring
Verify that the issue is resolved and performance has returned to normal
Notify stakeholders that the incident is resolved

Section 4: Communication Plan

Communication during an AI incident must be proactive, honest, and structured.

Internal communication:

incident channel or thread for real-time coordination
regular status updates every 30 minutes during active incidents
clear escalation path when the incident commander needs additional resources

Client communication:

initial notification within the SLA timeframe with what is known
regular updates even when the situation has not changed (silence breeds anxiety)
technical detail level appropriate for the audience
clear statement of impact, actions being taken, and expected timeline
honest acknowledgment when the cause is unknown or the timeline is uncertain

Communication templates:

Prepare templates for each severity level so that communication is fast and consistent during high-pressure situations.

Initial notification template:

Update template:

"Update on [system name] incident: [Current status]. Root cause: [identified/under investigation]. Actions taken: [description]. Expected resolution: [timeframe/unknown]. Next update: [timeframe]."

Resolution template:

Section 5: Post-Incident Review

Every Severity 1 and 2 incident should receive a formal post-incident review within one week of resolution.

Review content:

timeline of events from detection to resolution
root cause analysis
impact assessment (users affected, decisions impacted, financial cost)
evaluation of the response (what worked, what was slow, what was missed)
action items to prevent recurrence
updates to the playbook based on lessons learned

Share the review with the client for Severity 1 and 2 incidents. This demonstrates accountability and builds trust.

Section 6: Roles and Responsibilities

Incident Commander: Owns the incident from declaration to closure. Makes decisions about escalation, communication, and resource allocation.

Technical Lead: Leads the investigation and remediation effort. Provides technical updates to the incident commander.

Communication Lead: Handles all stakeholder communication. Ensures updates are timely and accurate.

On-Call Engineer: First responder for after-hours incidents. Performs initial assessment and escalation.

Define who fills each role, including backup assignments for when primary assignees are unavailable.

Section 7: Playbook Maintenance

The playbook is a living document. Update it:

after every significant incident (incorporate lessons learned)
when monitoring or alerting capabilities change
when new AI systems are deployed
when team members or roles change
at least quarterly for a general review

Client-Facing Incident Expectations

Include incident response terms in the client agreement:

defined SLAs for response and resolution by severity level
communication frequency during active incidents
post-incident reporting commitments
escalation contact information
exclusions (incidents caused by client actions, third-party outages, etc.)

Setting these expectations upfront prevents disputes during the stress of an actual incident.

The Trust Equation

How an agency handles incidents reveals more about its character than how it handles successes.

Agencies that respond quickly, communicate honestly, and learn systematically from failures build deeper client trust than agencies that never have incidents because they never monitor for them.

The playbook is not about preventing all failures. It is about ensuring that when failures occur, the agency's response is fast, transparent, and continuously improving.

AI Incident Response Playbook for Agency-Delivered Systems

Why AI Incidents Are Different

The Incident Response Playbook

Section 1: Incident Classification

Section 2: Detection and Alerting

Section 3: Response Procedures

Section 4: Communication Plan

Section 5: Post-Incident Review

Section 6: Roles and Responsibilities

Section 7: Playbook Maintenance

Client-Facing Incident Expectations

The Trust Equation

Agency Script Editorial

Related Articles

Complete EU AI Act Compliance Guide — What Every AI Agency Needs to Know and Do

HIPAA Compliance Guide for AI in Healthcare — Building AI Systems That Protect Patient Data

Question 14 Cost a Chicago Agency Its Fortune 500 Deal

Ready to certify your AI capability?

AI Incident Response Playbook for Agency-Delivered Systems

Why AI Incidents Are Different

The Incident Response Playbook

Section 1: Incident Classification

Section 2: Detection and Alerting

Section 3: Response Procedures

Section 4: Communication Plan

Section 5: Post-Incident Review

Section 6: Roles and Responsibilities

Section 7: Playbook Maintenance

Client-Facing Incident Expectations

The Trust Equation

Agency Script Editorial

Related Articles

Complete EU AI Act Compliance Guide — What Every AI Agency Needs to Know and Do

HIPAA Compliance Guide for AI in Healthcare — Building AI Systems That Protect Patient Data

Question 14 Cost a Chicago Agency Its Fortune 500 Deal

Ready to certify your AI capability?