AGENCYSCRIPT
CoursesEnterpriseBlog
đź‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
© 2026 Agency Script, Inc.·
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Why AI Incidents Are DifferentThe Nature of AI FailuresWhy Classification MattersAI Incident Classification FrameworkSeverity LevelsIncident TypesIncident Response ProceduresDetectionTriageResponsePost-Incident ReviewBuilding Your Incident Response CapabilityRunbooksOn-Call RotationIncident TrackingYour Next Step
Home/Blog/Classifying and Responding to AI Incidents
Governance

Classifying and Responding to AI Incidents

A

Agency Script Editorial

Editorial Team

·March 20, 2026·12 min read
ai incident responseai incident classificationai failure managementai system monitoring

A customer service AI built by an agency in Phoenix started generating inappropriate responses to customer complaints in November 2025. The AI was deployed for a regional bank, handling routine account inquiries. A model update from the underlying LLM provider subtly changed the model's behavior, causing it to occasionally generate responses that disclosed account details of other customers in what appeared to be contextual confusion. The bank's team noticed the issue when a customer called to report receiving someone else's account information in a chat interaction. By the time the issue was escalated, investigated, and the system was taken offline, fourteen customers had been affected over a seventy-two hour window. The agency had no incident classification system, no response playbook, and no escalation procedures. The bank's compliance team handled it as a data breach, triggering regulatory notification requirements. The AI agency lost the contract, and the bank spent $340,000 on breach notification, credit monitoring, and regulatory response.

AI systems fail differently than traditional software. Traditional software bugs are deterministic—the same input produces the same wrong output. AI failures are often probabilistic, intermittent, and difficult to detect. A model that works correctly 99.5 percent of the time is still failing every two hundred interactions, and those failures may be scattered across different failure modes. Without a systematic approach to classifying and responding to AI incidents, your agency will be reactive, slow, and unprepared when failures occur.

This post provides a comprehensive incident classification and response framework designed specifically for AI systems.

Why AI Incidents Are Different

The Nature of AI Failures

Probabilistic failures: AI systems fail probabilistically. A model might generate incorrect output for only certain input patterns, at certain times, or with certain contextual conditions. These failures are harder to detect, harder to reproduce, and harder to fix than deterministic bugs.

Gradual degradation: AI systems can degrade gradually over time as the data they encounter in production drifts from the data they were trained on. Performance may decline slowly enough that no single incident triggers an alert, but cumulative degradation significantly impacts quality.

Emergent behavior: AI systems, particularly large language models, can exhibit unexpected behaviors that were not present during testing. Prompt injection attacks, adversarial inputs, and novel input patterns can trigger responses that no one anticipated.

Cascading effects: AI systems often feed into downstream processes. An incorrect AI recommendation might trigger automated actions that compound the original error. A pricing model that generates incorrect prices might feed into automated quoting systems that generate incorrect quotes that lead to binding commitments.

Provider-originated failures: Many AI failures originate not from your code but from your AI provider. Model updates, API changes, rate limiting, and provider outages all create incidents that you must respond to even though you did not cause them.

Why Classification Matters

Without a classification system, every AI incident receives the same response—usually either panic or neglect. Classification enables:

  • Appropriate response allocation: Critical incidents get immediate attention; minor issues get scheduled review
  • Consistent communication: Clients receive consistent, appropriate communication based on incident severity
  • Resource planning: You can staff your team appropriately for the incident volume and severity you experience
  • Pattern detection: Classified incidents can be analyzed for patterns that reveal systemic issues
  • Regulatory compliance: Many regulations require incident response proportional to severity. Classification enables proportional response.

AI Incident Classification Framework

Severity Levels

Severity 1 — Critical

Impact: AI system is causing active harm to users, clients, or third parties. Data exposure, discriminatory outputs, safety-critical failures, or regulatory violations.

Examples:

  • AI system exposing personal data of users to other users
  • AI system generating discriminatory outputs in protected decision-making contexts (hiring, lending, healthcare)
  • AI system providing dangerous medical, legal, or financial advice that could cause harm
  • AI system experiencing a security breach with data exfiltration
  • AI system violating regulatory requirements in a way that triggers notification obligations

Response requirements:

  • Immediate escalation to agency leadership and client leadership
  • System taken offline or moved to safe mode within one hour
  • Incident commander assigned
  • Root cause investigation begins immediately
  • Client and affected parties notified per contractual and regulatory requirements
  • Post-incident review within 48 hours of resolution

Severity 2 — High

Impact: AI system is producing significantly incorrect or degraded outputs that affect business operations but are not causing active harm.

Examples:

  • AI system accuracy has dropped below acceptable thresholds across a significant portion of interactions
  • AI system is generating inappropriate but not harmful content
  • AI system is failing to process a significant portion of requests
  • AI model provider has announced deprecation of a model version your system depends on with a tight timeline
  • AI system is experiencing performance issues that significantly impact user experience

Response requirements:

  • Escalation to project lead and client point of contact within four hours
  • Investigation begins within four hours
  • Mitigation plan developed within 24 hours
  • Resolution within 72 hours or escalation to Severity 1
  • Post-incident review within one week of resolution

Severity 3 — Medium

Impact: AI system has a noticeable quality issue that affects some users or use cases but does not significantly impact overall operations.

Examples:

  • AI system performance has degraded moderately in a specific use case
  • AI system is generating occasional incorrect outputs that are caught by existing quality checks
  • A model provider has changed pricing or terms in a way that requires adjustment
  • AI system monitoring has detected drift that has not yet impacted quality significantly
  • A component in the AI stack has a known vulnerability that is not actively exploited

Response requirements:

  • Logged and assigned to appropriate team member
  • Investigation within one business day
  • Resolution within one week
  • Included in next regular client report

Severity 4 — Low

Impact: Minor issue with AI system quality, performance, or operations that does not affect users noticeably.

Examples:

  • Slight decrease in a secondary performance metric
  • Cosmetic issues in AI-generated content
  • Minor version updates available for dependencies
  • Documentation inconsistencies
  • Optimization opportunities identified through monitoring

Response requirements:

  • Logged for tracking
  • Addressed during regular maintenance cycles
  • No immediate escalation required

Incident Types

Beyond severity, classify incidents by type to enable pattern analysis and targeted improvements.

Output quality incidents: The AI produces incorrect, misleading, or substandard outputs.

  • Hallucination: AI generates factually incorrect information
  • Relevance failure: AI provides irrelevant responses to user queries
  • Quality degradation: Overall output quality has declined
  • Bias detection: AI outputs show patterns of bias

Safety incidents: The AI produces outputs that could cause harm.

  • Harmful content: AI generates content that could cause harm to users
  • Data leakage: AI reveals information it should not
  • Security exploitation: AI is manipulated through prompt injection or adversarial inputs
  • Regulatory violation: AI output violates applicable regulations

Availability incidents: The AI system is unavailable or significantly degraded.

  • System outage: AI service is completely unavailable
  • Performance degradation: AI service is available but significantly slow
  • Capacity exhaustion: AI service cannot handle the request volume
  • Provider outage: Underlying AI provider is experiencing issues

Integration incidents: The AI system's interaction with other systems is failing.

  • Data pipeline failure: Input data is not flowing correctly
  • Output integration failure: AI outputs are not being consumed correctly by downstream systems
  • Authentication or authorization failure: Access to AI services is disrupted
  • API compatibility: Provider API changes break existing integrations

Governance incidents: The AI system violates governance policies.

  • Compliance violation: AI system violates regulatory or contractual requirements
  • Policy violation: AI system behavior violates internal policies
  • Monitoring gap: Discovery of unmonitored AI behavior
  • Documentation gap: Discovery that AI system documentation is incomplete or inaccurate

Incident Response Procedures

Detection

You cannot respond to incidents you do not detect. Build detection capabilities across multiple channels.

Automated monitoring: Implement monitoring that tracks AI system quality, performance, and behavior in real time.

  • Output quality metrics: Accuracy, relevance scores, hallucination detection
  • Performance metrics: Latency, throughput, error rates
  • Safety monitors: Content filtering results, data leakage detection, anomaly detection
  • Availability metrics: Uptime, response time, capacity utilization

Human monitoring: Some AI failures are difficult to detect automatically. Implement human-in-the-loop monitoring.

  • Regular review of AI output samples by qualified team members
  • Client feedback channels for reporting AI quality issues
  • End-user feedback mechanisms (thumbs up/down, ratings)
  • Periodic quality audits comparing AI output against ground truth

External monitoring: Monitor external sources for information about issues that could affect your AI systems.

  • Provider status pages and incident notifications
  • Security vulnerability databases for AI frameworks and dependencies
  • Regulatory announcements that could affect AI compliance
  • Community reports of issues with models or tools you use

Triage

When an incident is detected, triage quickly.

Triage questions:

  • Is anyone being actively harmed? If yes, this is Severity 1.
  • Is the system producing significantly incorrect results? If yes, this is at least Severity 2.
  • Is the system available and functioning within normal parameters for most users? If no, this is at least Severity 2.
  • Is there a regulatory or contractual violation? If yes, this is at least Severity 2, possibly Severity 1.
  • Can the issue wait for normal business hours? If no, escalate immediately.

Triage should take no more than fifteen minutes. The goal is to classify severity and type quickly so that appropriate response procedures are activated. Detailed investigation happens after triage.

Response

For Severity 1 incidents:

  • Activate the incident response team immediately
  • Assign an incident commander who owns the response
  • Establish a communication channel (dedicated Slack channel, bridge call)
  • Assess the blast radius: how many users, what data, what systems are affected
  • Implement immediate containment: take the system offline, switch to fallback, or activate safe mode
  • Notify the client's designated incident contact
  • Document every action taken with timestamps
  • Assess regulatory notification requirements
  • Begin root cause investigation in parallel with containment
  • Provide regular status updates to all stakeholders

For Severity 2 incidents:

  • Assign a lead investigator
  • Assess the scope and impact
  • Develop a mitigation plan
  • Communicate status and plan to the client
  • Implement the mitigation
  • Verify the fix
  • Document the incident and resolution

For Severity 3 and 4 incidents:

  • Log the incident with classification and details
  • Assign to an appropriate team member
  • Resolve during normal work cycles
  • Update the incident log with resolution details

Post-Incident Review

Every Severity 1 and 2 incident should have a post-incident review (also called a postmortem or retrospective).

The review should cover:

  • What happened (timeline of events)
  • Why it happened (root cause analysis)
  • What was the impact (users affected, duration, financial impact)
  • How was it detected (was detection timely? could it have been detected sooner?)
  • How was it responded to (was the response effective? what could be improved?)
  • What will be done to prevent recurrence (specific, actionable improvements)
  • Who is responsible for each preventive action, and what is the timeline

Post-incident reviews should be blameless. The goal is to improve systems and processes, not to assign blame. People who fear blame will hide incidents, which makes everything worse.

Building Your Incident Response Capability

Runbooks

Create runbooks for common incident scenarios. A runbook provides step-by-step instructions for responding to a specific type of incident.

Essential runbooks for AI agencies:

  • LLM provider outage response
  • Model quality degradation response
  • Data leakage detection and response
  • AI output safety incident response
  • Client communication during incidents
  • Rollback procedures for model updates
  • Failover procedures for multi-provider architectures

Each runbook should include:

  • Trigger conditions (when to use this runbook)
  • Step-by-step procedures
  • Contact information for relevant team members and providers
  • Communication templates for client notification
  • Escalation criteria and procedures

On-Call Rotation

For agencies with production AI systems, establish an on-call rotation.

  • Define on-call coverage hours (24/7 for critical systems, business hours for less critical systems)
  • Ensure on-call engineers have access to all necessary systems and documentation
  • Define escalation procedures when the on-call engineer needs additional help
  • Compensate on-call time appropriately
  • Rotate on-call responsibilities to avoid burnout

Incident Tracking

Maintain an incident database that tracks all incidents.

For each incident, record:

  • Incident ID
  • Date and time detected
  • Date and time resolved
  • Severity and type classification
  • Description
  • Root cause
  • Impact (users affected, duration, financial)
  • Response actions taken
  • Preventive actions identified
  • Status of preventive actions

Analyze incident data regularly:

  • Monthly: Review incident volume and severity trends
  • Quarterly: Analyze patterns across incident types and identify systemic issues
  • Annually: Comprehensive review of incident response effectiveness and capability gaps

Your Next Step

Build your incident classification framework this week. Define your severity levels and incident types based on the frameworks described here, adapted to your specific AI services and client base. Then create runbooks for your three most likely incident scenarios—probably model quality degradation, provider outage, and data-related incidents.

Share your incident classification framework with your clients. Let them know how you categorize incidents, what response they can expect at each severity level, and how to report issues they detect. This transparency builds trust and ensures that when an incident does occur, everyone operates from the same playbook.

The agency that responds to AI incidents quickly, professionally, and transparently keeps clients even when things go wrong. The agency that fumbles its incident response loses clients even when the underlying AI system is good. Your incident response capability is as important as your AI development capability.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Governance

Complete EU AI Act Compliance Guide — What Every AI Agency Needs to Know and Do

The EU AI Act is the most comprehensive AI regulation on the planet. Here is exactly what it requires from AI agencies, which of your systems are affected, and a step-by-step compliance roadmap you can start executing today.

A
Agency Script Editorial
March 21, 2026·15 min read
Governance

HIPAA Compliance Guide for AI in Healthcare — Building AI Systems That Protect Patient Data

Healthcare AI is booming, but one HIPAA violation can end your agency. Here is the complete guide to building HIPAA-compliant AI systems, from BAAs to technical safeguards to breach response.

A
Agency Script Editorial
March 21, 2026·15 min read
Governance

Question 14 Cost a Chicago Agency Its Fortune 500 Deal

ISO 27001 certification is becoming a prerequisite for enterprise AI contracts. Here is the complete implementation guide from gap analysis to certification audit, tailored for AI agencies.

A
Agency Script Editorial
March 21, 2026·14 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification