AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Why AI Capacity Planning Is DifferentCapacity Planning FrameworkStep 1: Demand ForecastingStep 2: Resource Requirement MappingStep 3: Capacity Gap AnalysisStep 4: Capacity StrategyStep 5: Contingency PlanningDelivery ProcessPhase 1: Assessment and Measurement (Weeks 1-4)Phase 2: Planning and Modeling (Weeks 5-8)Phase 3: Implementation (Weeks 9-14)Phase 4: Ongoing Management (Continuous)Capacity Planning for Specific AI Workload TypesLLM Inference CapacityTraining CapacityData Processing CapacityCapacity Planning Tools and DashboardsWhat to BuildCommon Capacity Planning MistakesCapacity Planning for LLM WorkloadsCapacity Planning Tools and AutomationBuilding a Capacity Planning PracticePricing Capacity Planning EngagementsCapacity Planning for Rapid Growth ScenariosYour Next Step
Home/Blog/Provisioning for 2x Traffic When the Sale Brought 4x
Delivery

Provisioning for 2x Traffic When the Sale Brought 4x

A

Agency Script Editorial

Editorial Team

ยทMarch 21, 2026ยท13 min read
ai capacity planninginfrastructure planningai scalingai operations delivery

An e-commerce company's recommendation engine served 45 million predictions per day during normal operations. During their annual sales event, traffic spiked to 180 million predictions per day โ€” 4x normal. They had provisioned for 2x normal, based on the previous year's event. The recommendation service hit capacity limits within the first hour of the sale, latency spiked from 80 milliseconds to 3.2 seconds, and the product team disabled recommendations entirely to prevent the latency from affecting page load times. During the 72-hour sale event, the recommendation engine served zero predictions. Post-event analysis estimated that missing recommendations cost $2.8 million in lost incremental revenue. The company had capacity planning for their web infrastructure (which scaled perfectly) but no capacity planning for their AI infrastructure.

AI infrastructure capacity planning is a specialized discipline. AI workloads have unique characteristics โ€” GPU dependencies, variable processing costs, bursty training patterns, and model-specific resource requirements โ€” that make traditional capacity planning approaches insufficient.

Why AI Capacity Planning Is Different

GPU scarcity. Unlike CPU and memory, GPU capacity cannot be scaled instantly. Cloud GPU instances often have limited availability, especially for high-end models (A100, H100). During demand spikes, GPU instances may simply not be available regardless of budget.

Variable request costs. A web server handles HTTP requests with relatively uniform processing cost. An AI inference server handles requests with wildly variable costs โ€” from milliseconds to minutes depending on input size and model complexity.

Dual workload pattern. AI infrastructure serves two fundamentally different workloads: training (bursty, high-GPU, fault-tolerant) and inference (continuous, latency-sensitive, high-availability). Capacity planning must address both.

Model evolution. As models grow in size and complexity, their resource requirements change. A model that fits in one GPU today may require four GPUs next quarter. Capacity planning must account for the model roadmap.

Capacity Planning Framework

Step 1: Demand Forecasting

Current demand baseline:

Measure the current demand across all dimensions:

  • Inference requests per second (average, p95, peak)
  • Training jobs per week (average GPU hours consumed)
  • Data processing volume (GB processed per day)
  • Storage growth rate (GB per month for models, data, logs)

Growth projections:

Project demand growth based on:

  • Business growth plans (new users, new products, new markets)
  • AI roadmap (new models planned, model size increases, new use cases)
  • Seasonal patterns (known traffic spikes, promotional events)
  • Historical growth rate (if no business plans available, extrapolate historical trends)

Create three scenarios: Conservative (business as usual), moderate (planned growth), and aggressive (stretch goals). Plan for moderate but have contingency plans for aggressive.

Step 2: Resource Requirement Mapping

For each AI workload, map the demand to specific resource requirements.

Inference workloads:

  • Requests per second to GPU instances (based on throughput per instance per model)
  • Latency SLA to minimum instances (cannot scale below the number needed for SLA compliance)
  • Availability SLA to redundancy (99.9 percent availability requires at least N+1 instances)
  • Peak demand to autoscaling headroom

Training workloads:

  • Training frequency to GPU hours per month
  • Model size to GPU memory requirements
  • Training data size to storage requirements
  • Experiment volume to compute budget

Data processing workloads:

  • Data volume to compute instances and storage
  • Processing latency requirements to compute power
  • Pipeline frequency to scheduling capacity

Step 3: Capacity Gap Analysis

Compare projected requirements against current capacity to identify gaps.

Create a capacity timeline showing:

  • Current capacity for each resource type
  • Projected demand at 6, 12, 18, and 24 months
  • The point where demand exceeds capacity (capacity cliff)
  • The lead time needed to provision additional capacity

Identify the binding constraint โ€” the resource that will run out first. This determines the urgency of the capacity planning effort. Common binding constraints:

  • GPU instances for inference (limited by cloud availability and budget)
  • GPU hours for training (limited by budget and instance availability)
  • Storage for data and models (usually the easiest to scale)
  • Network bandwidth for data-intensive workloads

Step 4: Capacity Strategy

Scaling strategies:

Vertical scaling: Use more powerful instances (upgrade from T4 to A10G to A100). Quick to implement but has limits (you cannot upgrade beyond the most powerful available instance).

Horizontal scaling: Add more instances and distribute work across them. Requires load balancing and potentially model parallelism. Scales further than vertical but adds operational complexity.

Efficiency optimization: Reduce resource requirements per unit of work through model optimization (quantization, pruning, distillation), serving optimization (batching, caching), and infrastructure optimization (right-sizing, spot instances). This is often the highest-ROI strategy because it reduces cost while increasing capacity.

Reservation and commitment: For predictable base load, use reserved instances or committed use discounts to reduce cost by 30 to 60 percent. Use on-demand instances for variable demand above the base.

Multi-cloud or hybrid: For organizations hitting capacity limits with a single cloud provider, distributing across providers or using on-premises hardware provides additional capacity.

Step 5: Contingency Planning

Plan for scenarios that exceed your capacity projections.

Traffic spikes: What happens if inference demand doubles unexpectedly? Have autoscaling configured with sufficient headroom, or have a degradation strategy (serve cached predictions, reduce model complexity, shed low-priority traffic).

GPU shortage: What happens if cloud GPU instances are unavailable when you need to scale? Have a waitlist strategy, cross-region deployment capability, or fallback to CPU inference with optimized models.

Budget overrun: What happens if costs exceed projections? Have cost optimization levers ready to pull (reduce training frequency, apply aggressive model optimization, move to cheaper instance types).

Delivery Process

Phase 1: Assessment and Measurement (Weeks 1-4)

  • Instrument all AI workloads for resource consumption measurement
  • Baseline current demand and resource utilization
  • Interview stakeholders to understand business growth plans and AI roadmap
  • Collect historical data on traffic patterns and seasonal variations

Phase 2: Planning and Modeling (Weeks 5-8)

  • Build demand forecasting models
  • Map demand to resource requirements
  • Conduct capacity gap analysis
  • Develop the capacity strategy with three scenarios
  • Create the capacity timeline and investment plan
  • Develop contingency plans

Phase 3: Implementation (Weeks 9-14)

  • Implement autoscaling based on capacity plan parameters
  • Configure reserved instances for base load
  • Implement monitoring and alerting for capacity metrics
  • Build capacity dashboards showing current utilization and projected capacity cliffs
  • Implement cost tracking and budget alerting

Phase 4: Ongoing Management (Continuous)

  • Monthly capacity review against projections
  • Quarterly plan refresh with updated demand data
  • Annual comprehensive capacity planning cycle
  • Continuous cost optimization

Capacity Planning for Specific AI Workload Types

LLM Inference Capacity

LLM inference is the most expensive and most variable AI workload to plan for.

Key variables:

  • Tokens per request (input + output) โ€” varies dramatically by use case (50 tokens for classification, 5,000+ for long-form generation)
  • Concurrent conversations โ€” peak concurrent users times average tokens per second
  • Model size โ€” determines minimum GPU memory and instances required
  • Latency target โ€” tighter latency targets require more headroom and less batching

Capacity formula: Start with the peak tokens-per-second requirement. Divide by the throughput of a single serving instance (measured through benchmarking). Add 30 to 50 percent headroom for burst capacity. This gives the minimum number of serving instances.

Training Capacity

Training capacity planning must account for scheduled retraining, ad hoc experimentation, and one-time training events.

Scheduled retraining: Map out the retraining schedule for every production model. Sum the GPU-hours required for all scheduled training in a given month. This is the predictable base load.

Experimentation: Data scientists need GPU access for experiments. Plan for 2 to 4 hours of GPU time per data scientist per day. This can use preemptible or spot instances since experiments can tolerate interruption.

One-time training events: New model development and major model architecture changes require bursts of GPU capacity. These are unpredictable but can be estimated based on the AI roadmap. Reserve cloud GPU quotas in advance for known large training events.

Data Processing Capacity

Data processing capacity depends on data volume, processing complexity, and freshness requirements.

Batch processing: Map data volumes to processing time based on benchmarks. Plan for 2x the current volume to accommodate growth and handle catch-up processing after pipeline failures.

Stream processing: Map event rates to processing instance requirements. Plan for 3x the average event rate to handle peak periods and burst traffic.

Capacity Planning Tools and Dashboards

What to Build

Capacity dashboard. A single view showing current utilization and projected capacity for all AI resources:

  • GPU utilization by workload type (training, inference, experimentation)
  • Memory utilization by instance and model
  • Storage utilization and growth rate
  • Network bandwidth utilization
  • Queue depth for training jobs and inference requests

Forecast visualization. Charts showing projected demand versus current capacity with clearly marked capacity cliff dates โ€” the dates when demand is projected to exceed capacity.

Cost projection. Based on the capacity plan, project infrastructure costs at 6, 12, 18, and 24 months. Show the cost impact of different scaling strategies (reserved instances vs. on-demand, spot instances vs. dedicated).

Alert system. Automated alerts for:

  • Utilization exceeding 80 percent of capacity (pre-capacity-cliff warning)
  • Scaling events that are slower than expected (autoscaling delays)
  • Reserved instance utilization below 50 percent (over-provisioning waste)
  • Budget consumption exceeding forecast by more than 10 percent

Common Capacity Planning Mistakes

Mistake 1: Planning for average demand. Average demand is misleading because peaks drive capacity requirements. If average GPU utilization is 50 percent but peak utilization is 95 percent, you are already at the edge of capacity during peaks.

Mistake 2: Ignoring lead times. Cloud GPU instances, especially high-end ones (A100, H100), may not be available instantly. Lead times of days to weeks are common during high-demand periods. Plan capacity increases well in advance of projected need.

Mistake 3: Not accounting for the AI roadmap. A capacity plan based only on current workloads will be obsolete as soon as the next model is deployed. Always incorporate the AI roadmap โ€” planned model sizes, new use cases, expected user growth.

Mistake 4: Treating capacity planning as a one-time exercise. Capacity plans go stale within months as demand patterns change, new workloads are added, and growth rates shift. Capacity planning must be a continuous process with monthly reviews and quarterly plan updates.

Capacity Planning for LLM Workloads

LLM workloads introduce unique capacity planning challenges because their resource consumption is highly variable and depends on input and output token lengths.

Token-based capacity modeling. LLM capacity is not measured in requests per second โ€” it is measured in tokens per second. A request with a 100-token input and a 50-token output consumes far fewer resources than a request with a 10,000-token input and a 2,000-token output. Model capacity based on token volume, not request count.

Prompt caching effects. LLM APIs that support prompt caching (caching the KV cache for common prompt prefixes) can dramatically increase effective capacity for applications with shared system prompts. Factor caching into capacity planning โ€” with effective caching, the same infrastructure can handle 2x to 3x more requests.

Model switching capacity. Organizations running multiple LLM models (different sizes for different tasks) need capacity that accounts for the model mix. Routing 40 percent of traffic to a small model and 60 percent to a large model requires different capacity than routing all traffic to one model.

Capacity Planning Tools and Automation

Monitoring-driven capacity modeling. Use historical monitoring data (request rates, latencies, resource utilization) to build statistical models of capacity requirements. Tools like Prometheus with custom dashboards can visualize capacity trends and project future needs.

Automated capacity alerts. Set alerts when utilization approaches capacity limits โ€” 70 percent for warning, 85 percent for critical. These alerts trigger capacity review discussions before users experience degradation.

Auto-scaling as capacity management. Well-configured auto-scaling is a form of dynamic capacity management. Instead of provisioning fixed capacity for peak demand, auto-scaling adjusts capacity in real-time based on actual demand. However, auto-scaling has limits โ€” it cannot provision resources that are not available (GPU shortages) and it cannot scale instantly (cold-start latency). Capacity planning ensures that auto-scaling has the headroom it needs.

Building a Capacity Planning Practice

Capacity planning is not a one-time project โ€” it is an ongoing discipline that must be embedded in the organization's operational rhythm.

Monthly capacity reviews. Conduct monthly reviews comparing actual demand against projections. When actual demand diverges from projections by more than 15 percent, update the capacity plan. Monthly reviews catch surprises early โ€” before they become emergencies.

Quarterly plan refresh. Every quarter, refresh the full capacity plan with updated demand data, revised business growth projections, and any new AI workloads that have been added or are planned. The quarterly refresh ensures the capacity plan stays aligned with business reality.

Capacity planning as a service. For agencies, capacity planning is an excellent recurring revenue engagement. The initial assessment and plan are project-based, but the ongoing monitoring, review, and plan updates create a natural monthly retainer. Clients value the peace of mind that comes from knowing someone is watching their capacity trajectory and will flag issues before they become outages.

Cross-functional alignment. Effective capacity planning requires input from multiple stakeholders โ€” data scientists (model roadmap), product managers (user growth projections), engineering (infrastructure constraints), and finance (budget availability). Schedule quarterly alignment meetings that bring these stakeholders together to review capacity projections and agree on investment priorities.

Pricing Capacity Planning Engagements

  • Capacity assessment and planning: $15,000 to $40,000
  • Full capacity planning with implementation: $40,000 to $100,000
  • Ongoing capacity management: $5,000 to $15,000 per month

Capacity Planning for Rapid Growth Scenarios

Some organizations experience rapid AI adoption where inference volumes double or triple within months. Standard capacity planning assumes gradual growth and fails in these scenarios.

Early warning indicators. Monitor new model deployment frequency, new application onboarding rate, and user growth metrics. When these leading indicators accelerate beyond projections, trigger an off-cycle capacity review before the infrastructure hits its limit.

Burst capacity reserves. For organizations with unpredictable growth, maintain 30 to 50 percent spare capacity beyond projected needs. The cost of over-provisioning is far less than the cost of hitting capacity limits during a growth surge and losing users or revenue.

Your Next Step

This week: Check the GPU utilization of your clients' AI infrastructure. If average utilization is above 70 percent, they are at risk of capacity issues. If it is below 30 percent, they are overspending.

This month: Build a capacity planning template that covers inference, training, and data processing workloads. Include demand forecasting, resource mapping, and gap analysis.

This quarter: Deliver your first capacity planning engagement. Demonstrate the gap between projected demand and current capacity, and provide a concrete investment plan to close the gap.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification