Conversions Fell 19 Percent in Four Hours Before Anyone Noticed

A twenty-person AI agency in Seattle deployed a recommendation engine for a mid-market e-commerce client. Three weeks after launch, the model started returning wildly irrelevant results. Instead of recommending running shoes to fitness shoppers, it was recommending baby strollers to college students and power tools to people browsing jewelry. The client's conversion rate dropped nineteen percent in four hours before anyone noticed.

The client's VP of Engineering called the agency founder directly. The founder did not know who was on call. The engineer who built the model was on vacation in Portugal. The project manager had moved to a different account. The agency spent forty-five minutes figuring out who should be handling the situation before anyone started diagnosing the problem.

By the time they identified the root cause, a corrupted data feed from the client's product catalog system, the incident had been active for seven hours. The client's CEO was involved. The agency's reputation with the account was severely damaged, and the twelve-month contract renewal that was expected became a three-month probationary extension.

The technical fix took twenty minutes. The organizational failure took seven hours to navigate. That gap is what incident response procedures are designed to eliminate.

Why AI Agencies Need Formal Incident Response

Every agency will eventually face a production incident. Models degrade. APIs fail. Data pipelines break. Integrations drift. The question is not whether an incident will happen but whether your agency is prepared to handle it professionally when it does.

Without formal incident response, agencies default to chaos:

Nobody knows who is responsible for responding
Communication with the client is ad hoc and inconsistent
Technical diagnosis happens in parallel with political firefighting
The same incident gets investigated by multiple people who do not coordinate
Critical information gets lost in Slack threads that scroll out of view
The root cause is never properly identified, so similar incidents repeat

With formal incident response, the same incident unfolds differently:

An on-call engineer is alerted automatically and begins diagnosis within minutes
The client receives a structured status update within thirty minutes
A clear escalation path ensures the right people are involved at the right time
Communication follows a template that covers what happened, what is being done, and when the next update will arrive
A post-incident review identifies the root cause and produces preventive actions

The difference is not technical skill. It is operational preparedness.

Defining Incident Severity Levels

Not every issue is an emergency. Your team needs a shared framework for classifying incidents so that the response is proportional to the impact.

Severity 1 (Critical): Production system is down or severely degraded. Client's business operations are materially impacted. Revenue loss is occurring or data integrity is compromised.

Example: Model serving endpoint returns errors for all requests
Example: Data pipeline produces corrupted output that reaches the client's production environment
Example: Security breach exposing client data

Response: All-hands incident response. Client notified within thirty minutes. Status updates every sixty minutes until resolved. Post-incident review mandatory within 48 hours.

Severity 2 (High): Production system is degraded but functional. Workarounds exist but are not sustainable. Client experience is noticeably impacted.

Example: Model accuracy has dropped significantly but the system is still returning results
Example: API response times have tripled, causing timeouts for some users
Example: A scheduled data pipeline job failed and needs manual intervention

Response: On-call engineer engaged immediately. Client notified within two hours. Status updates every four hours. Post-incident review within one week.

Severity 3 (Medium): Non-critical functionality is impaired. No immediate business impact but the issue needs attention to prevent escalation.

Example: Monitoring dashboard is not updating but the underlying system is healthy
Example: A non-production environment is down, blocking development work
Example: A batch processing job completed with warnings but correct output

Response: On-call engineer assesses and schedules fix during normal business hours. Client notified if relevant. Post-incident review optional.

Severity 4 (Low): Minor issue with no immediate impact. Informational or cosmetic.

Example: A non-critical log file is generating excessive entries
Example: A development tool integration is broken
Example: Documentation is outdated

Response: Tracked in the issue tracker and prioritized normally. No urgent action required.

Building the On-Call Rotation

If your agency manages production systems for clients, you need someone available to respond when things break. That means an on-call rotation.

Determine on-call coverage hours. For most agencies, business-hours-plus-buffer coverage (7 AM to 10 PM in the client's timezone) is sufficient. True 24/7 coverage is only necessary if your systems handle after-hours traffic that cannot tolerate downtime.

Rotate on-call responsibilities. Do not let one person be the permanent on-call engineer. That leads to burnout, resentment, and a single point of failure. Rotate weekly among qualified engineers.

Define on-call expectations clearly:

Response time: Acknowledge the alert within fifteen minutes during coverage hours
Availability: Must have laptop and internet access, not stuck on a plane or at a concert
Authority: The on-call engineer can make emergency changes to production without waiting for approval from a lead
Escalation: If the on-call engineer cannot resolve the issue within thirty minutes, they escalate to the next person in the chain

Compensate on-call appropriately. Whether it is additional pay, comp time, or a reduced workload during on-call weeks, make sure the burden is acknowledged. On-call that is uncompensated quickly becomes on-call that is ignored.

Use alerting tools, not human monitoring. Set up automated alerts through PagerDuty, Opsgenie, or a similar tool. Alerts should be based on meaningful thresholds: error rates, latency spikes, model performance degradation, pipeline failures. Do not rely on someone watching a dashboard.

The Incident Response Playbook

When an incident occurs, the responder should follow a structured playbook, not improvise.

Step One: Acknowledge and Assess (0-15 minutes)

The on-call engineer receives the alert and acknowledges it in the alerting system. This stops the escalation timer and lets the team know someone is on it.

Immediate assessment checklist:

What system is affected?
Which client or clients are impacted?
What is the severity level?
Is the issue still active or has it resolved on its own?
Are there any obvious causes visible in logs, monitoring, or recent deployments?

Step Two: Communicate (15-30 minutes)

Based on the severity assessment, notify the appropriate people.

For Severity 1:

Notify the account manager and delivery lead immediately
The account manager contacts the client within thirty minutes with an initial status update
Open a dedicated incident channel (Slack channel or equivalent) for real-time coordination

For Severity 2:

Notify the account manager and delivery lead
The account manager contacts the client within two hours
Track updates in the project's existing communication channel

Client communication template for the initial update:

"We have identified an issue affecting [specific system or feature]. Our team is actively investigating. Here is what we know so far: [brief description]. We are currently [what you are doing to diagnose or fix]. We will provide our next update by [specific time]. If you have any questions in the meantime, please reach out to [contact person]."

What to never say:

"We are not sure what happened" (without also saying what you are doing to find out)
"This should not have happened" (the client already knows that)
"It is probably a minor issue" (if it were minor, you would not be calling)

Step Three: Diagnose and Resolve (Variable)

The on-call engineer works to identify the root cause and implement a fix. During this phase:

Prioritize restoration over root cause. If you can restore service by rolling back a deployment, reverting a data change, or switching to a fallback system, do that first. Then investigate the root cause in a stable environment.
Document everything in real time. Use the incident channel to log what you are investigating, what you have tried, and what you have found. This creates an audit trail for the post-incident review and helps anyone who joins the response mid-stream.
Escalate when needed. If you cannot resolve the issue within the expected timeframe for the severity level, escalate to additional team members, the technical lead, or external resources.
Coordinate with the client's team if necessary. Some incidents require action on the client's side. Communicate clearly about what you need from them and by when.

Step Four: Verify and Monitor (Post-fix)

After implementing a fix:

Verify that the system is functioning correctly
Monitor for recurrence for at least thirty minutes (longer for Severity 1)
Confirm with the client that the issue is resolved from their perspective
Remove any temporary workarounds that are no longer needed

Step Five: Communicate Resolution

Send a resolution notice to all stakeholders.

Resolution communication template:

"The issue affecting [specific system or feature] has been resolved as of [time]. The root cause was [brief, non-technical description]. We have implemented [description of fix]. We are monitoring the system and will alert you if there are any related issues. We will schedule a post-incident review to discuss preventive measures and will share the findings with your team."

Post-Incident Review

Every Severity 1 and Severity 2 incident should have a formal post-incident review within 48 hours to one week of resolution.

The post-incident review covers:

Timeline: A minute-by-minute reconstruction of the incident from detection to resolution
Root cause: What specifically caused the incident? Not "human error" but the systemic condition that allowed the error to have impact.
Detection: How was the incident detected? Could it have been detected sooner?
Response: How effectively did the team respond? Were there delays or confusion in the response process?
Impact: What was the actual impact on the client and on your agency?
Preventive actions: What changes will prevent this type of incident from recurring?

Share the post-incident review with the client. For Severity 1 incidents, the client deserves to understand what happened and what you are doing to prevent it from happening again. A well-written post-incident review actually strengthens client trust because it demonstrates accountability and competence.

Common AI-Specific Incident Types and Response Strategies

Model performance degradation. The model's accuracy, precision, or recall has dropped below acceptable thresholds. Often caused by data drift, upstream data quality changes, or an undetected issue in a retraining pipeline.

Immediate response: If a fallback model exists, switch to it. If not, determine whether degraded performance is better than no service.
Investigation: Check the input data distribution against the training data distribution. Review recent retraining runs for anomalies.

Data pipeline failure. A scheduled or event-driven data pipeline has failed, potentially causing downstream effects on model training, serving, or client reporting.

Immediate response: Determine the blast radius. Is only this pipeline affected or are downstream systems impacted?
Investigation: Check for changes in source data format, infrastructure issues, or dependency failures.

API availability incident. The model serving API is returning errors or timing out.

Immediate response: Check infrastructure health (compute instances, load balancers, networking). If a recent deployment occurred, consider rollback.
Investigation: Review deployment logs, scaling configuration, and resource utilization.

Security incident. Unauthorized access to client data, model theft, or API key exposure.

Immediate response: Contain the breach. Revoke compromised credentials. Isolate affected systems.
Investigation: Determine the scope of the breach, what data was accessed, and how access was gained.
Notification: Security incidents often trigger contractual and legal notification requirements. Involve your legal counsel early.

Building Incident Response Readiness

Run tabletop exercises quarterly. Gather the team and walk through a hypothetical incident scenario. Who gets the alert? What is the first action? Who calls the client? How do we diagnose this type of problem? Tabletop exercises reveal gaps in your process without the pressure of a real incident.

Maintain a contact list. Every team member should have access to an up-to-date list of on-call engineers, account managers, client technical contacts, and escalation contacts. Store this somewhere accessible even if your primary communication tool is down.

Keep runbooks for common issues. For known failure modes, create step-by-step runbooks that any on-call engineer can follow. A runbook for "model serving latency spike" saves precious minutes during an incident when stress makes clear thinking harder.

Review and update the incident response plan semi-annually. As your team, tools, and client base evolve, your incident response plan needs to evolve with them.

Your Next Step

If you do not have incident response procedures today, start with the minimum viable process.

This week, define your severity levels and create a one-page escalation chart that maps each severity to a response timeline and notification list.

This month, establish an on-call rotation for your production systems and set up automated alerting for critical failure conditions.

This quarter, create response playbooks for your three most common incident types and run a tabletop exercise with the team.

The goal is not to prevent all incidents. That is impossible. The goal is to respond to incidents so professionally that your clients' trust in you actually increases because of how you handle adversity. That is the difference between an agency that survives production incidents and one that thrives through them.

The technical fix took twenty minutes. The organizational failure took seven hours to navigate. That gap is what incident response procedures are designed to eliminate.

Why AI Agencies Need Formal Incident Response

Without formal incident response, agencies default to chaos:

Nobody knows who is responsible for responding
Communication with the client is ad hoc and inconsistent
Technical diagnosis happens in parallel with political firefighting
The same incident gets investigated by multiple people who do not coordinate
Critical information gets lost in Slack threads that scroll out of view
The root cause is never properly identified, so similar incidents repeat

With formal incident response, the same incident unfolds differently:

An on-call engineer is alerted automatically and begins diagnosis within minutes
The client receives a structured status update within thirty minutes
A clear escalation path ensures the right people are involved at the right time
Communication follows a template that covers what happened, what is being done, and when the next update will arrive
A post-incident review identifies the root cause and produces preventive actions

The difference is not technical skill. It is operational preparedness.

Defining Incident Severity Levels

Not every issue is an emergency. Your team needs a shared framework for classifying incidents so that the response is proportional to the impact.

Severity 1 (Critical): Production system is down or severely degraded. Client's business operations are materially impacted. Revenue loss is occurring or data integrity is compromised.

Example: Model serving endpoint returns errors for all requests
Example: Data pipeline produces corrupted output that reaches the client's production environment
Example: Security breach exposing client data

Response: All-hands incident response. Client notified within thirty minutes. Status updates every sixty minutes until resolved. Post-incident review mandatory within 48 hours.

Severity 2 (High): Production system is degraded but functional. Workarounds exist but are not sustainable. Client experience is noticeably impacted.

Example: Model accuracy has dropped significantly but the system is still returning results
Example: API response times have tripled, causing timeouts for some users
Example: A scheduled data pipeline job failed and needs manual intervention

Response: On-call engineer engaged immediately. Client notified within two hours. Status updates every four hours. Post-incident review within one week.

Severity 3 (Medium): Non-critical functionality is impaired. No immediate business impact but the issue needs attention to prevent escalation.

Example: Monitoring dashboard is not updating but the underlying system is healthy
Example: A non-production environment is down, blocking development work
Example: A batch processing job completed with warnings but correct output

Response: On-call engineer assesses and schedules fix during normal business hours. Client notified if relevant. Post-incident review optional.

Severity 4 (Low): Minor issue with no immediate impact. Informational or cosmetic.

Example: A non-critical log file is generating excessive entries
Example: A development tool integration is broken
Example: Documentation is outdated

Response: Tracked in the issue tracker and prioritized normally. No urgent action required.

Building the On-Call Rotation

If your agency manages production systems for clients, you need someone available to respond when things break. That means an on-call rotation.

Define on-call expectations clearly:

Response time: Acknowledge the alert within fifteen minutes during coverage hours
Availability: Must have laptop and internet access, not stuck on a plane or at a concert
Authority: The on-call engineer can make emergency changes to production without waiting for approval from a lead
Escalation: If the on-call engineer cannot resolve the issue within thirty minutes, they escalate to the next person in the chain

The Incident Response Playbook

When an incident occurs, the responder should follow a structured playbook, not improvise.

Step One: Acknowledge and Assess (0-15 minutes)

The on-call engineer receives the alert and acknowledges it in the alerting system. This stops the escalation timer and lets the team know someone is on it.

Immediate assessment checklist:

What system is affected?
Which client or clients are impacted?
What is the severity level?
Is the issue still active or has it resolved on its own?
Are there any obvious causes visible in logs, monitoring, or recent deployments?

Step Two: Communicate (15-30 minutes)

Based on the severity assessment, notify the appropriate people.

For Severity 1:

Notify the account manager and delivery lead immediately
The account manager contacts the client within thirty minutes with an initial status update
Open a dedicated incident channel (Slack channel or equivalent) for real-time coordination

For Severity 2:

Notify the account manager and delivery lead
The account manager contacts the client within two hours
Track updates in the project's existing communication channel

Client communication template for the initial update:

What to never say:

"We are not sure what happened" (without also saying what you are doing to find out)
"This should not have happened" (the client already knows that)
"It is probably a minor issue" (if it were minor, you would not be calling)

Step Three: Diagnose and Resolve (Variable)

The on-call engineer works to identify the root cause and implement a fix. During this phase:

Prioritize restoration over root cause. If you can restore service by rolling back a deployment, reverting a data change, or switching to a fallback system, do that first. Then investigate the root cause in a stable environment.
Document everything in real time. Use the incident channel to log what you are investigating, what you have tried, and what you have found. This creates an audit trail for the post-incident review and helps anyone who joins the response mid-stream.
Escalate when needed. If you cannot resolve the issue within the expected timeframe for the severity level, escalate to additional team members, the technical lead, or external resources.
Coordinate with the client's team if necessary. Some incidents require action on the client's side. Communicate clearly about what you need from them and by when.

Step Four: Verify and Monitor (Post-fix)

After implementing a fix:

Verify that the system is functioning correctly
Monitor for recurrence for at least thirty minutes (longer for Severity 1)
Confirm with the client that the issue is resolved from their perspective
Remove any temporary workarounds that are no longer needed

Step Five: Communicate Resolution

Send a resolution notice to all stakeholders.

Resolution communication template:

Post-Incident Review

Every Severity 1 and Severity 2 incident should have a formal post-incident review within 48 hours to one week of resolution.

The post-incident review covers:

Timeline: A minute-by-minute reconstruction of the incident from detection to resolution
Root cause: What specifically caused the incident? Not "human error" but the systemic condition that allowed the error to have impact.
Detection: How was the incident detected? Could it have been detected sooner?
Response: How effectively did the team respond? Were there delays or confusion in the response process?
Impact: What was the actual impact on the client and on your agency?
Preventive actions: What changes will prevent this type of incident from recurring?

Common AI-Specific Incident Types and Response Strategies

Immediate response: If a fallback model exists, switch to it. If not, determine whether degraded performance is better than no service.
Investigation: Check the input data distribution against the training data distribution. Review recent retraining runs for anomalies.

Data pipeline failure. A scheduled or event-driven data pipeline has failed, potentially causing downstream effects on model training, serving, or client reporting.

Immediate response: Determine the blast radius. Is only this pipeline affected or are downstream systems impacted?
Investigation: Check for changes in source data format, infrastructure issues, or dependency failures.

API availability incident. The model serving API is returning errors or timing out.

Immediate response: Check infrastructure health (compute instances, load balancers, networking). If a recent deployment occurred, consider rollback.
Investigation: Review deployment logs, scaling configuration, and resource utilization.

Security incident. Unauthorized access to client data, model theft, or API key exposure.

Immediate response: Contain the breach. Revoke compromised credentials. Isolate affected systems.
Investigation: Determine the scope of the breach, what data was accessed, and how access was gained.
Notification: Security incidents often trigger contractual and legal notification requirements. Involve your legal counsel early.

Building Incident Response Readiness

Review and update the incident response plan semi-annually. As your team, tools, and client base evolve, your incident response plan needs to evolve with them.

Your Next Step

If you do not have incident response procedures today, start with the minimum viable process.

This week, define your severity levels and create a one-page escalation chart that maps each severity to a response timeline and notification list.

This month, establish an on-call rotation for your production systems and set up automated alerting for critical failure conditions.

This quarter, create response playbooks for your three most common incident types and run a tabletop exercise with the team.

Conversions Fell 19 Percent in Four Hours Before Anyone Noticed

Why AI Agencies Need Formal Incident Response

Defining Incident Severity Levels

Building the On-Call Rotation

The Incident Response Playbook

Step One: Acknowledge and Assess (0-15 minutes)

Step Two: Communicate (15-30 minutes)

Step Three: Diagnose and Resolve (Variable)

Step Four: Verify and Monitor (Post-fix)

Step Five: Communicate Resolution

Post-Incident Review

Common AI-Specific Incident Types and Response Strategies

Building Incident Response Readiness

Your Next Step

Agency Script Editorial

Related Articles

Understaffed or Overstaffed? Both Camps Were Right.

Optimizing Daily Standups for Distributed AI Agency Teams

Complete Utilization Rate Management Guide — The Metric That Makes or Breaks Agency Profitability

Ready to certify your AI capability?

Conversions Fell 19 Percent in Four Hours Before Anyone Noticed

Why AI Agencies Need Formal Incident Response

Defining Incident Severity Levels

Building the On-Call Rotation

The Incident Response Playbook

Step One: Acknowledge and Assess (0-15 minutes)

Step Two: Communicate (15-30 minutes)

Step Three: Diagnose and Resolve (Variable)

Step Four: Verify and Monitor (Post-fix)

Step Five: Communicate Resolution

Post-Incident Review

Common AI-Specific Incident Types and Response Strategies

Building Incident Response Readiness

Your Next Step

Agency Script Editorial

Related Articles

Understaffed or Overstaffed? Both Camps Were Right.

Optimizing Daily Standups for Distributed AI Agency Teams

Complete Utilization Rate Management Guide — The Metric That Makes or Breaks Agency Profitability

Ready to certify your AI capability?