Provisioning for 2x Traffic When the Sale Brought 4x

An e-commerce company's recommendation engine served 45 million predictions per day during normal operations. During their annual sales event, traffic spiked to 180 million predictions per day — 4x normal. They had provisioned for 2x normal, based on the previous year's event. The recommendation service hit capacity limits within the first hour of the sale, latency spiked from 80 milliseconds to 3.2 seconds, and the product team disabled recommendations entirely to prevent the latency from affecting page load times. During the 72-hour sale event, the recommendation engine served zero predictions. Post-event analysis estimated that missing recommendations cost $2.8 million in lost incremental revenue. The company had capacity planning for their web infrastructure (which scaled perfectly) but no capacity planning for their AI infrastructure.

AI infrastructure capacity planning is a specialized discipline. AI workloads have unique characteristics — GPU dependencies, variable processing costs, bursty training patterns, and model-specific resource requirements — that make traditional capacity planning approaches insufficient.

Why AI Capacity Planning Is Different

GPU scarcity. Unlike CPU and memory, GPU capacity cannot be scaled instantly. Cloud GPU instances often have limited availability, especially for high-end models (A100, H100). During demand spikes, GPU instances may simply not be available regardless of budget.

Variable request costs. A web server handles HTTP requests with relatively uniform processing cost. An AI inference server handles requests with wildly variable costs — from milliseconds to minutes depending on input size and model complexity.

Dual workload pattern. AI infrastructure serves two fundamentally different workloads: training (bursty, high-GPU, fault-tolerant) and inference (continuous, latency-sensitive, high-availability). Capacity planning must address both.

Model evolution. As models grow in size and complexity, their resource requirements change. A model that fits in one GPU today may require four GPUs next quarter. Capacity planning must account for the model roadmap.

Capacity Planning Framework

Step 1: Demand Forecasting

Current demand baseline:

Measure the current demand across all dimensions:

Inference requests per second (average, p95, peak)
Training jobs per week (average GPU hours consumed)
Data processing volume (GB processed per day)
Storage growth rate (GB per month for models, data, logs)

Growth projections:

Project demand growth based on:

Business growth plans (new users, new products, new markets)
AI roadmap (new models planned, model size increases, new use cases)
Seasonal patterns (known traffic spikes, promotional events)
Historical growth rate (if no business plans available, extrapolate historical trends)

Create three scenarios: Conservative (business as usual), moderate (planned growth), and aggressive (stretch goals). Plan for moderate but have contingency plans for aggressive.

Step 2: Resource Requirement Mapping

For each AI workload, map the demand to specific resource requirements.

Inference workloads:

Requests per second to GPU instances (based on throughput per instance per model)
Latency SLA to minimum instances (cannot scale below the number needed for SLA compliance)
Availability SLA to redundancy (99.9 percent availability requires at least N+1 instances)
Peak demand to autoscaling headroom

Training workloads:

Training frequency to GPU hours per month
Model size to GPU memory requirements
Training data size to storage requirements
Experiment volume to compute budget

Data processing workloads:

Data volume to compute instances and storage
Processing latency requirements to compute power
Pipeline frequency to scheduling capacity

Step 3: Capacity Gap Analysis

Compare projected requirements against current capacity to identify gaps.

Create a capacity timeline showing:

Current capacity for each resource type
Projected demand at 6, 12, 18, and 24 months
The point where demand exceeds capacity (capacity cliff)
The lead time needed to provision additional capacity

Identify the binding constraint — the resource that will run out first. This determines the urgency of the capacity planning effort. Common binding constraints:

GPU instances for inference (limited by cloud availability and budget)
GPU hours for training (limited by budget and instance availability)
Storage for data and models (usually the easiest to scale)
Network bandwidth for data-intensive workloads

Step 4: Capacity Strategy

Scaling strategies:

Vertical scaling: Use more powerful instances (upgrade from T4 to A10G to A100). Quick to implement but has limits (you cannot upgrade beyond the most powerful available instance).

Horizontal scaling: Add more instances and distribute work across them. Requires load balancing and potentially model parallelism. Scales further than vertical but adds operational complexity.

Efficiency optimization: Reduce resource requirements per unit of work through model optimization (quantization, pruning, distillation), serving optimization (batching, caching), and infrastructure optimization (right-sizing, spot instances). This is often the highest-ROI strategy because it reduces cost while increasing capacity.

Reservation and commitment: For predictable base load, use reserved instances or committed use discounts to reduce cost by 30 to 60 percent. Use on-demand instances for variable demand above the base.

Multi-cloud or hybrid: For organizations hitting capacity limits with a single cloud provider, distributing across providers or using on-premises hardware provides additional capacity.

Step 5: Contingency Planning

Plan for scenarios that exceed your capacity projections.

Traffic spikes: What happens if inference demand doubles unexpectedly? Have autoscaling configured with sufficient headroom, or have a degradation strategy (serve cached predictions, reduce model complexity, shed low-priority traffic).

GPU shortage: What happens if cloud GPU instances are unavailable when you need to scale? Have a waitlist strategy, cross-region deployment capability, or fallback to CPU inference with optimized models.

Budget overrun: What happens if costs exceed projections? Have cost optimization levers ready to pull (reduce training frequency, apply aggressive model optimization, move to cheaper instance types).

Delivery Process

Phase 1: Assessment and Measurement (Weeks 1-4)

Instrument all AI workloads for resource consumption measurement
Baseline current demand and resource utilization
Interview stakeholders to understand business growth plans and AI roadmap
Collect historical data on traffic patterns and seasonal variations

Phase 2: Planning and Modeling (Weeks 5-8)

Build demand forecasting models
Map demand to resource requirements
Conduct capacity gap analysis
Develop the capacity strategy with three scenarios
Create the capacity timeline and investment plan
Develop contingency plans

Phase 3: Implementation (Weeks 9-14)

Implement autoscaling based on capacity plan parameters
Configure reserved instances for base load
Implement monitoring and alerting for capacity metrics
Build capacity dashboards showing current utilization and projected capacity cliffs
Implement cost tracking and budget alerting

Phase 4: Ongoing Management (Continuous)

Monthly capacity review against projections
Quarterly plan refresh with updated demand data
Annual comprehensive capacity planning cycle
Continuous cost optimization

Capacity Planning for Specific AI Workload Types

LLM Inference Capacity

LLM inference is the most expensive and most variable AI workload to plan for.

Key variables:

Tokens per request (input + output) — varies dramatically by use case (50 tokens for classification, 5,000+ for long-form generation)
Concurrent conversations — peak concurrent users times average tokens per second
Model size — determines minimum GPU memory and instances required
Latency target — tighter latency targets require more headroom and less batching

Capacity formula: Start with the peak tokens-per-second requirement. Divide by the throughput of a single serving instance (measured through benchmarking). Add 30 to 50 percent headroom for burst capacity. This gives the minimum number of serving instances.

Training Capacity

Training capacity planning must account for scheduled retraining, ad hoc experimentation, and one-time training events.

Scheduled retraining: Map out the retraining schedule for every production model. Sum the GPU-hours required for all scheduled training in a given month. This is the predictable base load.

Experimentation: Data scientists need GPU access for experiments. Plan for 2 to 4 hours of GPU time per data scientist per day. This can use preemptible or spot instances since experiments can tolerate interruption.

One-time training events: New model development and major model architecture changes require bursts of GPU capacity. These are unpredictable but can be estimated based on the AI roadmap. Reserve cloud GPU quotas in advance for known large training events.

Data Processing Capacity

Data processing capacity depends on data volume, processing complexity, and freshness requirements.

Batch processing: Map data volumes to processing time based on benchmarks. Plan for 2x the current volume to accommodate growth and handle catch-up processing after pipeline failures.

Stream processing: Map event rates to processing instance requirements. Plan for 3x the average event rate to handle peak periods and burst traffic.

Capacity Planning Tools and Dashboards

What to Build

Capacity dashboard. A single view showing current utilization and projected capacity for all AI resources:

GPU utilization by workload type (training, inference, experimentation)
Memory utilization by instance and model
Storage utilization and growth rate
Network bandwidth utilization
Queue depth for training jobs and inference requests

Forecast visualization. Charts showing projected demand versus current capacity with clearly marked capacity cliff dates — the dates when demand is projected to exceed capacity.

Cost projection. Based on the capacity plan, project infrastructure costs at 6, 12, 18, and 24 months. Show the cost impact of different scaling strategies (reserved instances vs. on-demand, spot instances vs. dedicated).

Alert system. Automated alerts for:

Utilization exceeding 80 percent of capacity (pre-capacity-cliff warning)
Scaling events that are slower than expected (autoscaling delays)
Reserved instance utilization below 50 percent (over-provisioning waste)
Budget consumption exceeding forecast by more than 10 percent

Common Capacity Planning Mistakes

Mistake 1: Planning for average demand. Average demand is misleading because peaks drive capacity requirements. If average GPU utilization is 50 percent but peak utilization is 95 percent, you are already at the edge of capacity during peaks.

Mistake 2: Ignoring lead times. Cloud GPU instances, especially high-end ones (A100, H100), may not be available instantly. Lead times of days to weeks are common during high-demand periods. Plan capacity increases well in advance of projected need.

Mistake 3: Not accounting for the AI roadmap. A capacity plan based only on current workloads will be obsolete as soon as the next model is deployed. Always incorporate the AI roadmap — planned model sizes, new use cases, expected user growth.

Mistake 4: Treating capacity planning as a one-time exercise. Capacity plans go stale within months as demand patterns change, new workloads are added, and growth rates shift. Capacity planning must be a continuous process with monthly reviews and quarterly plan updates.

Capacity Planning for LLM Workloads

LLM workloads introduce unique capacity planning challenges because their resource consumption is highly variable and depends on input and output token lengths.

Token-based capacity modeling. LLM capacity is not measured in requests per second — it is measured in tokens per second. A request with a 100-token input and a 50-token output consumes far fewer resources than a request with a 10,000-token input and a 2,000-token output. Model capacity based on token volume, not request count.

Prompt caching effects. LLM APIs that support prompt caching (caching the KV cache for common prompt prefixes) can dramatically increase effective capacity for applications with shared system prompts. Factor caching into capacity planning — with effective caching, the same infrastructure can handle 2x to 3x more requests.

Model switching capacity. Organizations running multiple LLM models (different sizes for different tasks) need capacity that accounts for the model mix. Routing 40 percent of traffic to a small model and 60 percent to a large model requires different capacity than routing all traffic to one model.

Capacity Planning Tools and Automation

Monitoring-driven capacity modeling. Use historical monitoring data (request rates, latencies, resource utilization) to build statistical models of capacity requirements. Tools like Prometheus with custom dashboards can visualize capacity trends and project future needs.

Automated capacity alerts. Set alerts when utilization approaches capacity limits — 70 percent for warning, 85 percent for critical. These alerts trigger capacity review discussions before users experience degradation.

Auto-scaling as capacity management. Well-configured auto-scaling is a form of dynamic capacity management. Instead of provisioning fixed capacity for peak demand, auto-scaling adjusts capacity in real-time based on actual demand. However, auto-scaling has limits — it cannot provision resources that are not available (GPU shortages) and it cannot scale instantly (cold-start latency). Capacity planning ensures that auto-scaling has the headroom it needs.

Building a Capacity Planning Practice

Capacity planning is not a one-time project — it is an ongoing discipline that must be embedded in the organization's operational rhythm.

Monthly capacity reviews. Conduct monthly reviews comparing actual demand against projections. When actual demand diverges from projections by more than 15 percent, update the capacity plan. Monthly reviews catch surprises early — before they become emergencies.

Quarterly plan refresh. Every quarter, refresh the full capacity plan with updated demand data, revised business growth projections, and any new AI workloads that have been added or are planned. The quarterly refresh ensures the capacity plan stays aligned with business reality.

Capacity planning as a service. For agencies, capacity planning is an excellent recurring revenue engagement. The initial assessment and plan are project-based, but the ongoing monitoring, review, and plan updates create a natural monthly retainer. Clients value the peace of mind that comes from knowing someone is watching their capacity trajectory and will flag issues before they become outages.

Cross-functional alignment. Effective capacity planning requires input from multiple stakeholders — data scientists (model roadmap), product managers (user growth projections), engineering (infrastructure constraints), and finance (budget availability). Schedule quarterly alignment meetings that bring these stakeholders together to review capacity projections and agree on investment priorities.

Pricing Capacity Planning Engagements

Capacity assessment and planning: $15,000 to $40,000
Full capacity planning with implementation: $40,000 to $100,000
Ongoing capacity management: $5,000 to $15,000 per month

Capacity Planning for Rapid Growth Scenarios

Some organizations experience rapid AI adoption where inference volumes double or triple within months. Standard capacity planning assumes gradual growth and fails in these scenarios.

Early warning indicators. Monitor new model deployment frequency, new application onboarding rate, and user growth metrics. When these leading indicators accelerate beyond projections, trigger an off-cycle capacity review before the infrastructure hits its limit.

Burst capacity reserves. For organizations with unpredictable growth, maintain 30 to 50 percent spare capacity beyond projected needs. The cost of over-provisioning is far less than the cost of hitting capacity limits during a growth surge and losing users or revenue.

Your Next Step

This week: Check the GPU utilization of your clients' AI infrastructure. If average utilization is above 70 percent, they are at risk of capacity issues. If it is below 30 percent, they are overspending.

This month: Build a capacity planning template that covers inference, training, and data processing workloads. Include demand forecasting, resource mapping, and gap analysis.

This quarter: Deliver your first capacity planning engagement. Demonstrate the gap between projected demand and current capacity, and provide a concrete investment plan to close the gap.

Why AI Capacity Planning Is Different

Capacity Planning Framework

Step 1: Demand Forecasting

Current demand baseline:

Measure the current demand across all dimensions:

Inference requests per second (average, p95, peak)
Training jobs per week (average GPU hours consumed)
Data processing volume (GB processed per day)
Storage growth rate (GB per month for models, data, logs)

Growth projections:

Project demand growth based on:

Business growth plans (new users, new products, new markets)
AI roadmap (new models planned, model size increases, new use cases)
Seasonal patterns (known traffic spikes, promotional events)
Historical growth rate (if no business plans available, extrapolate historical trends)

Create three scenarios: Conservative (business as usual), moderate (planned growth), and aggressive (stretch goals). Plan for moderate but have contingency plans for aggressive.

Step 2: Resource Requirement Mapping

For each AI workload, map the demand to specific resource requirements.

Inference workloads:

Requests per second to GPU instances (based on throughput per instance per model)
Latency SLA to minimum instances (cannot scale below the number needed for SLA compliance)
Availability SLA to redundancy (99.9 percent availability requires at least N+1 instances)
Peak demand to autoscaling headroom

Training workloads:

Training frequency to GPU hours per month
Model size to GPU memory requirements
Training data size to storage requirements
Experiment volume to compute budget

Data processing workloads:

Data volume to compute instances and storage
Processing latency requirements to compute power
Pipeline frequency to scheduling capacity

Step 3: Capacity Gap Analysis

Compare projected requirements against current capacity to identify gaps.

Create a capacity timeline showing:

Current capacity for each resource type
Projected demand at 6, 12, 18, and 24 months
The point where demand exceeds capacity (capacity cliff)
The lead time needed to provision additional capacity

Identify the binding constraint — the resource that will run out first. This determines the urgency of the capacity planning effort. Common binding constraints:

GPU instances for inference (limited by cloud availability and budget)
GPU hours for training (limited by budget and instance availability)
Storage for data and models (usually the easiest to scale)
Network bandwidth for data-intensive workloads

Step 4: Capacity Strategy

Scaling strategies:

Vertical scaling: Use more powerful instances (upgrade from T4 to A10G to A100). Quick to implement but has limits (you cannot upgrade beyond the most powerful available instance).

Horizontal scaling: Add more instances and distribute work across them. Requires load balancing and potentially model parallelism. Scales further than vertical but adds operational complexity.

Multi-cloud or hybrid: For organizations hitting capacity limits with a single cloud provider, distributing across providers or using on-premises hardware provides additional capacity.

Step 5: Contingency Planning

Plan for scenarios that exceed your capacity projections.

Delivery Process

Phase 1: Assessment and Measurement (Weeks 1-4)

Instrument all AI workloads for resource consumption measurement
Baseline current demand and resource utilization
Interview stakeholders to understand business growth plans and AI roadmap
Collect historical data on traffic patterns and seasonal variations

Phase 2: Planning and Modeling (Weeks 5-8)

Build demand forecasting models
Map demand to resource requirements
Conduct capacity gap analysis
Develop the capacity strategy with three scenarios
Create the capacity timeline and investment plan
Develop contingency plans

Phase 3: Implementation (Weeks 9-14)

Implement autoscaling based on capacity plan parameters
Configure reserved instances for base load
Implement monitoring and alerting for capacity metrics
Build capacity dashboards showing current utilization and projected capacity cliffs
Implement cost tracking and budget alerting

Phase 4: Ongoing Management (Continuous)

Monthly capacity review against projections
Quarterly plan refresh with updated demand data
Annual comprehensive capacity planning cycle
Continuous cost optimization

Capacity Planning for Specific AI Workload Types

LLM Inference Capacity

LLM inference is the most expensive and most variable AI workload to plan for.

Key variables:

Tokens per request (input + output) — varies dramatically by use case (50 tokens for classification, 5,000+ for long-form generation)
Concurrent conversations — peak concurrent users times average tokens per second
Model size — determines minimum GPU memory and instances required
Latency target — tighter latency targets require more headroom and less batching

Training Capacity

Training capacity planning must account for scheduled retraining, ad hoc experimentation, and one-time training events.

Scheduled retraining: Map out the retraining schedule for every production model. Sum the GPU-hours required for all scheduled training in a given month. This is the predictable base load.

Data Processing Capacity

Data processing capacity depends on data volume, processing complexity, and freshness requirements.

Batch processing: Map data volumes to processing time based on benchmarks. Plan for 2x the current volume to accommodate growth and handle catch-up processing after pipeline failures.

Stream processing: Map event rates to processing instance requirements. Plan for 3x the average event rate to handle peak periods and burst traffic.

Capacity Planning Tools and Dashboards

What to Build

Capacity dashboard. A single view showing current utilization and projected capacity for all AI resources:

GPU utilization by workload type (training, inference, experimentation)
Memory utilization by instance and model
Storage utilization and growth rate
Network bandwidth utilization
Queue depth for training jobs and inference requests

Forecast visualization. Charts showing projected demand versus current capacity with clearly marked capacity cliff dates — the dates when demand is projected to exceed capacity.

Alert system. Automated alerts for:

Utilization exceeding 80 percent of capacity (pre-capacity-cliff warning)
Scaling events that are slower than expected (autoscaling delays)
Reserved instance utilization below 50 percent (over-provisioning waste)
Budget consumption exceeding forecast by more than 10 percent

Common Capacity Planning Mistakes

Capacity Planning for LLM Workloads

LLM workloads introduce unique capacity planning challenges because their resource consumption is highly variable and depends on input and output token lengths.

Capacity Planning Tools and Automation

Building a Capacity Planning Practice

Capacity planning is not a one-time project — it is an ongoing discipline that must be embedded in the organization's operational rhythm.

Pricing Capacity Planning Engagements

Capacity assessment and planning: $15,000 to $40,000
Full capacity planning with implementation: $40,000 to $100,000
Ongoing capacity management: $5,000 to $15,000 per month

Capacity Planning for Rapid Growth Scenarios

Some organizations experience rapid AI adoption where inference volumes double or triple within months. Standard capacity planning assumes gradual growth and fails in these scenarios.

Your Next Step

This month: Build a capacity planning template that covers inference, training, and data processing workloads. Include demand forecasting, resource mapping, and gap analysis.

This quarter: Deliver your first capacity planning engagement. Demonstrate the gap between projected demand and current capacity, and provide a concrete investment plan to close the gap.

Provisioning for 2x Traffic When the Sale Brought 4x

Why AI Capacity Planning Is Different

Capacity Planning Framework

Step 1: Demand Forecasting

Step 2: Resource Requirement Mapping

Step 3: Capacity Gap Analysis

Step 4: Capacity Strategy

Step 5: Contingency Planning

Delivery Process

Phase 1: Assessment and Measurement (Weeks 1-4)

Phase 2: Planning and Modeling (Weeks 5-8)

Phase 3: Implementation (Weeks 9-14)

Phase 4: Ongoing Management (Continuous)

Capacity Planning for Specific AI Workload Types

LLM Inference Capacity

Training Capacity

Data Processing Capacity

Capacity Planning Tools and Dashboards

What to Build

Common Capacity Planning Mistakes

Capacity Planning for LLM Workloads

Capacity Planning Tools and Automation

Building a Capacity Planning Practice

Pricing Capacity Planning Engagements

Capacity Planning for Rapid Growth Scenarios

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?

Provisioning for 2x Traffic When the Sale Brought 4x

Why AI Capacity Planning Is Different

Capacity Planning Framework

Step 1: Demand Forecasting

Step 2: Resource Requirement Mapping

Step 3: Capacity Gap Analysis

Step 4: Capacity Strategy

Step 5: Contingency Planning

Delivery Process

Phase 1: Assessment and Measurement (Weeks 1-4)

Phase 2: Planning and Modeling (Weeks 5-8)

Phase 3: Implementation (Weeks 9-14)

Phase 4: Ongoing Management (Continuous)

Capacity Planning for Specific AI Workload Types

LLM Inference Capacity

Training Capacity

Data Processing Capacity

Capacity Planning Tools and Dashboards

What to Build

Common Capacity Planning Mistakes

Capacity Planning for LLM Workloads

Capacity Planning Tools and Automation

Building a Capacity Planning Practice

Pricing Capacity Planning Engagements

Capacity Planning for Rapid Growth Scenarios

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?