Locked Out of Your Cloud Account at 9am on a Tuesday

A 14-person AI agency in Phoenix woke up one Tuesday morning to find their primary cloud account locked. Someone had compromised the root credentials and changed the recovery email. Every client project, every training pipeline, every deployed model, and every internal tool was inaccessible. The team spent 72 hours working with the cloud provider's support team to regain access. During those 72 hours, three client projects missed deadlines, two production AI systems went down because they could not be maintained, and one client initiated their termination clause.

The total cost: $140,000 in lost revenue, client remediation, and emergency consulting fees to restore systems. The agency survived, but barely. The founder later said the scariest part was realizing they had no plan. Every decision during those 72 hours was improvised under extreme stress.

Disaster recovery planning is the work you do when things are calm so that you can act decisively when things are not. Most AI agencies skip it because disasters feel unlikely and the planning feels abstract. But for a digital-first business that depends on cloud infrastructure, remote teams, and interconnected systems, the question is not whether something will go wrong. It is when.

What Can Go Wrong (And Will)

AI agencies face disasters across several categories. Understanding the threat landscape helps you plan proportionally.

Infrastructure Failures

Cloud provider outage. Major cloud providers experience significant outages several times per year. Multi-hour outages are not uncommon, and multi-day outages, while rare, have occurred.
Account compromise. Stolen credentials, phishing attacks, or misconfigured access controls can lock you out of your own systems.
Data loss. Accidental deletion, storage failures, or ransomware can destroy client data, model artifacts, and internal systems.
Third-party service failure. Your time tracking, project management, communication, or payment tools going down disrupts operations.

People Failures

Key person sudden departure. What happens if your CTO has a medical emergency and is unreachable for a month? What if your lead engineer quits with no notice?
Founder incapacitation. If you are a founder-led agency and you cannot work for an extended period, who keeps the business running?
Team-wide illness. A severe flu outbreak or similar event can take out 30-40% of a small team simultaneously.

Business Failures

Major client sudden termination. Your largest client cancels with minimal notice, removing 30% of your revenue.
Cash flow crisis. A major client delays payment for 90+ days, creating a cash crunch that threatens payroll.
Legal action. A client sues over a project failure or an employee files a significant employment claim.

External Events

Natural disasters. If your team is concentrated in one geographic area, local disasters (hurricanes, earthquakes, wildfires) can disable your entire operation.
Regulatory changes. A new AI regulation invalidates your service model or creates compliance requirements you cannot quickly meet.
Cyber attack. Ransomware, data breach, or targeted attack against your agency or a client you are working with.

The Disaster Recovery Plan Framework

Your disaster recovery plan should cover three phases: Prevention (reducing the likelihood of disasters), Response (what to do when a disaster occurs), and Recovery (how to get back to normal operations).

Phase 1: Prevention

Infrastructure hardening:

Multi-factor authentication everywhere. Every cloud account, every SaaS tool, every email account. No exceptions. This single measure prevents the majority of account compromise scenarios.
Access control review. Monthly review of who has access to what. Remove access immediately when someone leaves the team. Limit administrative access to 2-3 people.
Automated backups. All critical data backed up daily to a separate cloud provider or region. Test restoring from backups quarterly. A backup you have never tested is not a backup.
Infrastructure as code. Document your infrastructure setup so it can be recreated. If your cloud environment is destroyed, can you rebuild it from documentation? Infrastructure as code tools (Terraform, Pulumi) make this practical.
Secrets management. Use a dedicated secrets manager (HashiCorp Vault, AWS Secrets Manager, 1Password). No credentials stored in code, documents, or email.

People resilience:

Documentation of critical knowledge. Every critical system, process, and client relationship should be documented well enough that someone else can take over. The bus factor (how many people need to be hit by a bus to cripple the organization) should be at least 2 for every critical function.
Cross-training. At least two people should understand every critical system. Regular rotation of responsibilities builds this naturally.
Succession planning. Who runs the agency if the founder cannot? Even at a 10-person agency, this question needs an answer. Document it and share it with the designated person.

Financial resilience:

Cash reserves. Maintain 3-6 months of operating expenses in reserve. This buffer absorbs client loss, payment delays, and unexpected costs.
Revenue diversification. No single client should represent more than 30% of revenue. No single industry should represent more than 50%.
Insurance. Professional liability, cyber liability, and business interruption insurance. Review coverage annually with your insurance broker.

Phase 2: Response

When a disaster occurs, the first 4-24 hours determine whether it becomes a crisis or an inconvenience. Build response playbooks for your most likely scenarios.

The Incident Commander model:

Designate one person as the Incident Commander for any disaster scenario. This person:

Makes decisions during the crisis
Coordinates the response team
Communicates with stakeholders (clients, team, vendors)
Delegates specific response tasks
Determines when the crisis is over

Typically, the founder is the primary Incident Commander with the CTO or operations lead as the backup.

Response playbook template:

For each disaster scenario, document:

Detection: How will we know this has happened?
Assessment: What is the scope and severity? (15-minute assessment)
Communication: Who needs to know, in what order? (Clients, team, vendors, legal)
Containment: What immediate actions stop the bleeding?
Resolution: What steps restore normal operations?
Escalation: When do we bring in external help? (Lawyers, PR, specialized IT support)

Sample Response Playbooks

Playbook: Cloud Account Compromise

Detection: Unauthorized access alerts, locked accounts, unexpected changes
Assessment: What is affected? Can we still access backup accounts?
Communication: Notify the team immediately via backup channel (phone tree if Slack is compromised). Notify affected clients within 4 hours.
Containment: Revoke all access tokens. Contact cloud provider security team. Activate backup infrastructure if available.
Resolution: Work with cloud provider to regain access. Audit all changes made by the unauthorized party. Restore from backups if data was modified.
Escalation: Engage cybersecurity incident response firm if the breach involves client data. Engage legal counsel if notification requirements are triggered.

Playbook: Major Client Sudden Termination

Detection: Client notification of contract termination
Assessment: What is the revenue impact? What is the timeline? What contractual obligations remain?
Communication: Inform leadership team immediately. Inform affected team members within 24 hours.
Containment: Review the contract for termination terms, transition requirements, and outstanding payments. Ensure all client deliverables and data are properly handled.
Resolution: Accelerate pipeline development to replace revenue. Assess whether team adjustments (redeployment or reduction) are needed. Conduct a post-mortem to understand why the client left.
Escalation: Engage legal counsel if termination terms are disputed.

Playbook: Ransomware Attack

Detection: Systems encrypted, ransom demand received
Assessment: What systems are affected? Are backups intact?
Communication: Do not communicate over potentially compromised channels. Use phone or personal devices. Notify legal counsel immediately. Notify affected clients per your incident response plan.
Containment: Isolate affected systems. Do not pay the ransom without consulting legal and cybersecurity experts. Verify backup integrity before beginning restoration.
Resolution: Rebuild affected systems from backups and infrastructure-as-code documentation. Investigate the attack vector and close it. Implement additional security measures.
Escalation: Engage cybersecurity incident response firm immediately. Notify law enforcement. Engage PR support if the incident becomes public.

Playbook: Key Person Sudden Absence

Detection: Person is unreachable for 24+ hours without planned absence
Assessment: What projects are they critical to? What knowledge is undocumented?
Communication: Inform affected project teams. Contact clients on their projects to explain the situation (appropriately, without oversharing).
Containment: Identify the most urgent responsibilities and reassign them. Access their documentation and project notes.
Resolution: Distribute their workload across the team. Reprioritize projects if necessary. Begin recruiting a replacement if the absence will be extended.
Escalation: If the absence is indefinite, engage a contract recruiter for rapid backfill.

Phase 3: Recovery

After the immediate crisis is resolved, focus on returning to normal operations and preventing recurrence.

Post-incident review (within one week of resolution):

What happened? (Factual timeline)
How did we detect it?
How effective was our response?
What worked well?
What would we do differently?
What changes should we make to prevent recurrence?

Client relationship repair:

Proactively contact every affected client
Provide a clear explanation of what happened (appropriate level of detail)
Explain what you have done to prevent recurrence
Offer concrete remediation if they were impacted (timeline extensions, service credits)

System improvements:

Implement the preventive changes identified in the post-incident review
Update the response playbooks based on lessons learned
Conduct additional training if the incident revealed knowledge gaps

The Backup and Recovery Checklist

At minimum, your agency should back up and have recovery procedures for:

Client project data:

Code repositories (GitHub/GitLab provide built-in redundancy, but verify)
Trained models and model artifacts
Training and evaluation datasets
Project documentation and specifications
Client communication records

Business data:

Financial records and accounting data
Client contracts and legal documents
Employee records and HR documents
CRM data and pipeline information
Time tracking and project management data

Infrastructure:

Cloud infrastructure configuration (documented or in code)
DNS and domain configurations
SSL certificates and encryption keys
API keys and service credentials (in a secrets manager)
CI/CD pipeline configurations

For each backup, document:

Where it is stored (provider, region, account)
How often it runs (daily, weekly, continuous)
How long backups are retained
How to restore from backup (step-by-step procedure)
When it was last tested (target: quarterly)

Testing Your Disaster Recovery Plan

A plan you have never tested is a wish, not a plan. Test your DR plan through three mechanisms:

Tabletop exercises (quarterly). Gather the leadership team for a 60-minute session. Present a disaster scenario and walk through the response. "It is 3 AM and our cloud account has been compromised. What do we do?" Discuss the playbook steps, identify gaps, and update procedures.

Backup restoration tests (quarterly). Actually restore a system from backup and verify it works. Rotate which system you test each quarter. Document the time it takes and any issues encountered.

Communication tests (semi-annually). Test your emergency communication chain. Can you reach every team member through your backup communication method (phone tree, personal email) within 30 minutes? Identify unreachable people and update contact information.

Building Organizational Resilience

Beyond specific disaster planning, build general resilience into your agency culture.

Normalize the conversation. Discuss risks and contingencies openly. Teams that talk about "what if" scenarios regularly handle actual crises better than teams that avoid the topic.

Empower decision-making. In a crisis, waiting for the founder to make every decision is a bottleneck. Build a culture where team leads can make reasonable decisions autonomously when the situation demands it.

Maintain a crisis communication plan. Who talks to clients? Who talks to the press (if applicable)? Who talks to the team? Having designated communicators prevents conflicting messages during a crisis.

Build relationships before you need them. Know your cloud provider's support escalation process before you need it. Have a cybersecurity firm's number before you are attacked. Have a lawyer's number before you are sued.

Your Next Step

This week, answer one question honestly: if you were unable to work for the next 30 days starting tomorrow, what would happen to your agency? Write down every critical function that only you can perform, every system that only you can access, and every relationship that only you manage. This exercise will reveal your most acute disaster recovery gaps. For each item on the list, identify one person who could take over with proper documentation and access, and schedule time this month to create that documentation and share that access. This single exercise, applied to the founder first and then to every critical role, is the foundation of organizational resilience.

What Can Go Wrong (And Will)

AI agencies face disasters across several categories. Understanding the threat landscape helps you plan proportionally.

Infrastructure Failures

Cloud provider outage. Major cloud providers experience significant outages several times per year. Multi-hour outages are not uncommon, and multi-day outages, while rare, have occurred.
Account compromise. Stolen credentials, phishing attacks, or misconfigured access controls can lock you out of your own systems.
Data loss. Accidental deletion, storage failures, or ransomware can destroy client data, model artifacts, and internal systems.
Third-party service failure. Your time tracking, project management, communication, or payment tools going down disrupts operations.

People Failures

Key person sudden departure. What happens if your CTO has a medical emergency and is unreachable for a month? What if your lead engineer quits with no notice?
Founder incapacitation. If you are a founder-led agency and you cannot work for an extended period, who keeps the business running?
Team-wide illness. A severe flu outbreak or similar event can take out 30-40% of a small team simultaneously.

Business Failures

Major client sudden termination. Your largest client cancels with minimal notice, removing 30% of your revenue.
Cash flow crisis. A major client delays payment for 90+ days, creating a cash crunch that threatens payroll.
Legal action. A client sues over a project failure or an employee files a significant employment claim.

External Events

Natural disasters. If your team is concentrated in one geographic area, local disasters (hurricanes, earthquakes, wildfires) can disable your entire operation.
Regulatory changes. A new AI regulation invalidates your service model or creates compliance requirements you cannot quickly meet.
Cyber attack. Ransomware, data breach, or targeted attack against your agency or a client you are working with.

The Disaster Recovery Plan Framework

Phase 1: Prevention

Infrastructure hardening:

Multi-factor authentication everywhere. Every cloud account, every SaaS tool, every email account. No exceptions. This single measure prevents the majority of account compromise scenarios.
Access control review. Monthly review of who has access to what. Remove access immediately when someone leaves the team. Limit administrative access to 2-3 people.
Automated backups. All critical data backed up daily to a separate cloud provider or region. Test restoring from backups quarterly. A backup you have never tested is not a backup.
Infrastructure as code. Document your infrastructure setup so it can be recreated. If your cloud environment is destroyed, can you rebuild it from documentation? Infrastructure as code tools (Terraform, Pulumi) make this practical.
Secrets management. Use a dedicated secrets manager (HashiCorp Vault, AWS Secrets Manager, 1Password). No credentials stored in code, documents, or email.

People resilience:

Documentation of critical knowledge. Every critical system, process, and client relationship should be documented well enough that someone else can take over. The bus factor (how many people need to be hit by a bus to cripple the organization) should be at least 2 for every critical function.
Cross-training. At least two people should understand every critical system. Regular rotation of responsibilities builds this naturally.
Succession planning. Who runs the agency if the founder cannot? Even at a 10-person agency, this question needs an answer. Document it and share it with the designated person.

Financial resilience:

Cash reserves. Maintain 3-6 months of operating expenses in reserve. This buffer absorbs client loss, payment delays, and unexpected costs.
Revenue diversification. No single client should represent more than 30% of revenue. No single industry should represent more than 50%.
Insurance. Professional liability, cyber liability, and business interruption insurance. Review coverage annually with your insurance broker.

Phase 2: Response

When a disaster occurs, the first 4-24 hours determine whether it becomes a crisis or an inconvenience. Build response playbooks for your most likely scenarios.

The Incident Commander model:

Designate one person as the Incident Commander for any disaster scenario. This person:

Makes decisions during the crisis
Coordinates the response team
Communicates with stakeholders (clients, team, vendors)
Delegates specific response tasks
Determines when the crisis is over

Typically, the founder is the primary Incident Commander with the CTO or operations lead as the backup.

Response playbook template:

For each disaster scenario, document:

Detection: How will we know this has happened?
Assessment: What is the scope and severity? (15-minute assessment)
Communication: Who needs to know, in what order? (Clients, team, vendors, legal)
Containment: What immediate actions stop the bleeding?
Resolution: What steps restore normal operations?
Escalation: When do we bring in external help? (Lawyers, PR, specialized IT support)

Sample Response Playbooks

Playbook: Cloud Account Compromise

Detection: Unauthorized access alerts, locked accounts, unexpected changes
Assessment: What is affected? Can we still access backup accounts?
Communication: Notify the team immediately via backup channel (phone tree if Slack is compromised). Notify affected clients within 4 hours.
Containment: Revoke all access tokens. Contact cloud provider security team. Activate backup infrastructure if available.
Resolution: Work with cloud provider to regain access. Audit all changes made by the unauthorized party. Restore from backups if data was modified.
Escalation: Engage cybersecurity incident response firm if the breach involves client data. Engage legal counsel if notification requirements are triggered.

Playbook: Major Client Sudden Termination

Detection: Client notification of contract termination
Assessment: What is the revenue impact? What is the timeline? What contractual obligations remain?
Communication: Inform leadership team immediately. Inform affected team members within 24 hours.
Containment: Review the contract for termination terms, transition requirements, and outstanding payments. Ensure all client deliverables and data are properly handled.
Resolution: Accelerate pipeline development to replace revenue. Assess whether team adjustments (redeployment or reduction) are needed. Conduct a post-mortem to understand why the client left.
Escalation: Engage legal counsel if termination terms are disputed.

Playbook: Ransomware Attack

Detection: Systems encrypted, ransom demand received
Assessment: What systems are affected? Are backups intact?
Communication: Do not communicate over potentially compromised channels. Use phone or personal devices. Notify legal counsel immediately. Notify affected clients per your incident response plan.
Containment: Isolate affected systems. Do not pay the ransom without consulting legal and cybersecurity experts. Verify backup integrity before beginning restoration.
Resolution: Rebuild affected systems from backups and infrastructure-as-code documentation. Investigate the attack vector and close it. Implement additional security measures.
Escalation: Engage cybersecurity incident response firm immediately. Notify law enforcement. Engage PR support if the incident becomes public.

Playbook: Key Person Sudden Absence

Detection: Person is unreachable for 24+ hours without planned absence
Assessment: What projects are they critical to? What knowledge is undocumented?
Communication: Inform affected project teams. Contact clients on their projects to explain the situation (appropriately, without oversharing).
Containment: Identify the most urgent responsibilities and reassign them. Access their documentation and project notes.
Resolution: Distribute their workload across the team. Reprioritize projects if necessary. Begin recruiting a replacement if the absence will be extended.
Escalation: If the absence is indefinite, engage a contract recruiter for rapid backfill.

Phase 3: Recovery

After the immediate crisis is resolved, focus on returning to normal operations and preventing recurrence.

Post-incident review (within one week of resolution):

What happened? (Factual timeline)
How did we detect it?
How effective was our response?
What worked well?
What would we do differently?
What changes should we make to prevent recurrence?

Client relationship repair:

Proactively contact every affected client
Provide a clear explanation of what happened (appropriate level of detail)
Explain what you have done to prevent recurrence
Offer concrete remediation if they were impacted (timeline extensions, service credits)

System improvements:

Implement the preventive changes identified in the post-incident review
Update the response playbooks based on lessons learned
Conduct additional training if the incident revealed knowledge gaps

The Backup and Recovery Checklist

At minimum, your agency should back up and have recovery procedures for:

Client project data:

Code repositories (GitHub/GitLab provide built-in redundancy, but verify)
Trained models and model artifacts
Training and evaluation datasets
Project documentation and specifications
Client communication records

Business data:

Financial records and accounting data
Client contracts and legal documents
Employee records and HR documents
CRM data and pipeline information
Time tracking and project management data

Infrastructure:

Cloud infrastructure configuration (documented or in code)
DNS and domain configurations
SSL certificates and encryption keys
API keys and service credentials (in a secrets manager)
CI/CD pipeline configurations

For each backup, document:

Where it is stored (provider, region, account)
How often it runs (daily, weekly, continuous)
How long backups are retained
How to restore from backup (step-by-step procedure)
When it was last tested (target: quarterly)

Testing Your Disaster Recovery Plan

A plan you have never tested is a wish, not a plan. Test your DR plan through three mechanisms:

Backup restoration tests (quarterly). Actually restore a system from backup and verify it works. Rotate which system you test each quarter. Document the time it takes and any issues encountered.

Building Organizational Resilience

Beyond specific disaster planning, build general resilience into your agency culture.

Normalize the conversation. Discuss risks and contingencies openly. Teams that talk about "what if" scenarios regularly handle actual crises better than teams that avoid the topic.

Locked Out of Your Cloud Account at 9am on a Tuesday

What Can Go Wrong (And Will)

Infrastructure Failures

People Failures

Business Failures

External Events

The Disaster Recovery Plan Framework

Phase 1: Prevention

Phase 2: Response

Sample Response Playbooks

Phase 3: Recovery

The Backup and Recovery Checklist

Testing Your Disaster Recovery Plan

Building Organizational Resilience

Your Next Step

Agency Script Editorial

Related Articles

Understaffed or Overstaffed? Both Camps Were Right.

Optimizing Daily Standups for Distributed AI Agency Teams

Complete Utilization Rate Management Guide — The Metric That Makes or Breaks Agency Profitability

Ready to certify your AI capability?

Locked Out of Your Cloud Account at 9am on a Tuesday

What Can Go Wrong (And Will)

Infrastructure Failures

People Failures

Business Failures

External Events

The Disaster Recovery Plan Framework

Phase 1: Prevention

Phase 2: Response

Sample Response Playbooks

Phase 3: Recovery

The Backup and Recovery Checklist

Testing Your Disaster Recovery Plan

Building Organizational Resilience

Your Next Step

Agency Script Editorial

Related Articles

Understaffed or Overstaffed? Both Camps Were Right.

Optimizing Daily Standups for Distributed AI Agency Teams

Complete Utilization Rate Management Guide — The Metric That Makes or Breaks Agency Profitability

Ready to certify your AI capability?