A growing AI agency in San Francisco was spending $47,000 per month on GPU cloud costs across three cloud providers. Their GPU utilization averaged 23% โ meaning 77% of the GPU hours they were paying for sat idle. Training jobs ran on expensive on-demand A100 instances when spot instances would have worked. Inference workloads used the same instance types as training, burning through high-end GPUs for workloads that could run on cheaper hardware. Nobody had a clear picture of which client's workloads were consuming which resources. After a two-week infrastructure audit and restructuring, the agency reduced monthly GPU costs to $18,000, increased training throughput by 40% through better scheduling and hardware matching, and implemented per-client cost tracking that enabled accurate project profitability analysis for the first time.
GPU cluster management is the operational discipline of acquiring, allocating, scheduling, monitoring, and optimizing GPU resources for ML training and inference workloads. For AI agencies, GPU costs are often the second-largest expense after personnel, and poorly managed GPU infrastructure can turn profitable projects into money losers. This guide covers the full spectrum of GPU management decisions โ from hardware selection to cost optimization to multi-tenant operations.
GPU Hardware Selection
Understanding GPU Tiers
Not all GPUs are created equal, and not all workloads need the same GPU. Matching the right GPU to each workload is the foundation of cost-efficient GPU management.
Training-class GPUs:
- NVIDIA A100 (40GB/80GB): The workhorse for large-scale training. Excellent for large language models, large vision models, and any training job that benefits from high memory bandwidth and large VRAM. Cost: $2.50-4.00/hour on-demand across major cloud providers.
- NVIDIA H100: 2-3x faster than A100 for transformer training. Essential for training models with billions of parameters. Justified only for the largest training jobs. Cost: $4.50-8.00/hour on-demand.
- NVIDIA A10G: Good balance of price and performance for medium-scale training. Suitable for fine-tuning pre-trained models and training models with fewer than 1 billion parameters. Cost: $1.00-1.50/hour on-demand.
- NVIDIA L4: Optimized for inference but capable of training smaller models. Good for fine-tuning and training on a budget. Cost: $0.50-0.80/hour on-demand.
Inference-class GPUs:
- NVIDIA T4: The default choice for production inference. Excellent price-performance ratio for serving models. Cost: $0.30-0.50/hour on-demand.
- NVIDIA L4: Better inference throughput than T4 with comparable pricing. Replacing T4 as the default inference GPU on newer cloud instances.
- NVIDIA A10G: For inference workloads that need more VRAM or compute than T4/L4 โ large language models, high-resolution image models, or batched inference with large batch sizes.
Cost-saving GPUs:
- Spot/preemptible instances: 60-80% cheaper than on-demand. Use for training workloads that can be interrupted and resumed (checkpoint regularly). Not suitable for inference workloads that need continuous availability.
- Reserved instances: 30-60% cheaper than on-demand for 1-3 year commitments. Use for inference workloads with predictable, sustained demand.
- Previous-generation GPUs (V100, P100): 40-60% cheaper than current generation. Suitable for smaller training jobs and inference workloads that do not need cutting-edge performance.
Right-Sizing Workloads
The most common GPU waste is using overpowered GPUs for workloads that do not need them.
Workload-to-GPU matching guidelines:
- Fine-tuning a BERT model: A10G or even T4 with gradient accumulation. An A100 is overkill.
- Training a custom vision model on 100K images: A single A10G or A100 depending on model size and batch size requirements.
- Training a billion-parameter language model: Multi-GPU A100 or H100. This is where expensive hardware is justified.
- Serving a BERT-based classification model: T4 or L4. A10G only if you need to serve multiple models on the same GPU.
- Serving a large language model (7B+ parameters): A10G or A100 depending on model size and throughput requirements.
- Running inference on images: T4 for batch inference, L4 for real-time inference with low latency requirements.
How to right-size:
- Run the workload on the smallest reasonable GPU
- Monitor GPU utilization, memory usage, and inference latency
- If GPU utilization is below 50%, downgrade to a cheaper GPU
- If GPU utilization is above 90% or memory is near capacity, upgrade to a more powerful GPU
- Re-evaluate every time the workload changes (new model version, new data volume, new latency requirements)
Cluster Architecture
Cloud-Based Clusters
Most agencies run GPU workloads in the cloud rather than managing physical hardware. The key architectural decisions are provider selection, instance management, and networking.
Multi-cloud strategy:
Running workloads across multiple cloud providers provides access to more GPU availability (specific GPU types are often sold out on individual providers), negotiating leverage, and resilience against provider-specific outages.
Practical multi-cloud for agencies:
- Use one primary provider for most workloads (minimize operational complexity)
- Use a secondary provider as overflow for peak demand or when specific GPU types are unavailable
- Use Kubernetes (EKS, GKE, AKS) as an abstraction layer to reduce provider lock-in
- Standardize on container-based workloads that run identically across providers
Kubernetes for GPU Workloads
Kubernetes has become the standard orchestration platform for GPU workloads, providing scheduling, scaling, and management capabilities that are essential for multi-tenant GPU clusters.
Kubernetes GPU configuration:
- NVIDIA GPU Operator: Automates the management of NVIDIA GPU drivers, container runtime, and device plugins on Kubernetes nodes. Install this first.
- GPU resource requests: Define GPU resource requests in pod specifications. Kubernetes will schedule pods only on nodes with available GPUs.
- GPU sharing (MPS or time-slicing): Allow multiple pods to share a single GPU when workloads do not fully utilize GPU resources. MPS (Multi-Process Service) provides concurrent GPU access; time-slicing provides sequential access. Use MPS for inference workloads that underutilize the GPU.
- Node pools: Create separate node pools for different GPU types. Training workloads run on high-end GPU node pools; inference workloads run on cost-efficient GPU node pools.
Namespace isolation for multi-tenant agencies:
- Create a Kubernetes namespace per client
- Apply resource quotas per namespace to prevent one client's workloads from consuming all GPU resources
- Use network policies to isolate traffic between client namespaces
- Implement RBAC (role-based access control) so team members can only access their assigned client namespaces
Job Scheduling
GPU job scheduling determines which workloads run when and on which hardware. Good scheduling maximizes GPU utilization and minimizes wait times.
Scheduling strategies:
- Priority-based scheduling: Assign priority levels to workloads. Production inference gets the highest priority, client-facing training jobs get medium priority, internal experiments get low priority. High-priority jobs preempt low-priority jobs when resources are scarce.
- Fair-share scheduling: Allocate GPU resources proportionally across clients or teams. Each client is guaranteed a minimum share of GPU resources, with unused capacity redistributed to other clients.
- Bin-packing: Schedule multiple small workloads on the same GPU to maximize utilization. This is particularly effective for inference workloads that individually use less than 50% of a GPU.
- Gang scheduling: Schedule all pods of a distributed training job simultaneously. This prevents deadlocks where some pods are scheduled but the job cannot start because other pods are waiting for resources.
Tools for GPU job scheduling:
- Volcano: A Kubernetes-native batch scheduling system designed for GPU workloads. Supports gang scheduling, fair-share scheduling, and priority-based preemption.
- Kueue: A newer Kubernetes job queueing system from Google that provides resource quotas, priority-based scheduling, and multi-tenant fairness.
- Run:ai: A commercial GPU scheduling platform that provides advanced scheduling, GPU sharing, and utilization optimization.
Cost Optimization
Spot Instance Strategy
Spot instances (AWS) or preemptible instances (GCP) provide GPU capacity at 60-80% discount with the caveat that instances can be reclaimed with short notice.
Making spot instances work for training:
- Checkpoint frequently: Save model checkpoints every 15-30 minutes. When a spot instance is reclaimed, resume training from the most recent checkpoint.
- Use a spot-friendly training framework: PyTorch Lightning, Hugging Face Transformers, and other modern training frameworks support automatic checkpoint saving and resumption.
- Diversify instance types: Request multiple GPU instance types in your spot request. This increases the probability of getting capacity and reduces the frequency of interruptions.
- Mix spot and on-demand: Use spot for the majority of training compute (80-90%) and on-demand for the checkpoint storage and coordination nodes.
Spot instances are NOT suitable for:
- Production inference (service availability is paramount)
- Short training jobs (the overhead of checkpointing and resumption outweighs the savings)
- Interactive development and debugging
Reserved Instance Planning
For predictable, sustained GPU usage, reserved instances provide significant savings.
How to plan reserved capacity:
- Analyze GPU usage over the past 3-6 months
- Identify the baseline โ the minimum GPU usage that occurs consistently
- Purchase reserved capacity to cover the baseline
- Use on-demand or spot instances for usage above the baseline
Example:
- Agency uses an average of 20 GPUs continuously for inference across all clients
- Peak training demand adds 30 additional GPUs, but only 40% of the time
- Reserve 20 inference GPUs (30-40% savings on the stable baseline)
- Use spot instances for training demand (60-80% savings on the variable component)
- Total savings compared to all on-demand: 45-55%
GPU Utilization Optimization
Monitor utilization metrics:
- GPU compute utilization: Percentage of time the GPU's compute units are active. Target: above 60% for training, above 40% for inference (inference is bursty by nature).
- GPU memory utilization: Percentage of GPU VRAM in use. Target: 70-90% for training (leave headroom for gradient accumulation spikes), 50-80% for inference.
- GPU memory bandwidth utilization: For memory-bound workloads, this may be the real bottleneck even when compute utilization appears low.
Optimization techniques:
- Batch size optimization: Increase batch size until GPU memory is 80-90% utilized. Larger batches improve compute utilization by reducing the ratio of overhead to useful computation.
- Model parallelism: For models that do not fit on a single GPU, split the model across multiple GPUs. This is necessary for training and serving large language models.
- Data parallelism: For models that fit on a single GPU, replicate the model across multiple GPUs and split the training data. This scales training throughput linearly with the number of GPUs (with some communication overhead).
- Mixed-precision training: Use FP16 or BF16 precision for training. This halves memory usage and increases throughput by 50-100% on GPUs with tensor cores (all modern NVIDIA GPUs).
- Gradient accumulation: Simulate larger batch sizes without increasing memory usage by accumulating gradients over multiple forward passes before performing a parameter update.
Per-Client Cost Tracking
For agencies serving multiple clients, tracking GPU costs per client is essential for accurate project profitability analysis and fair billing.
Implementation:
- Tag all GPU resources (instances, pods, jobs) with the client identifier
- Use cloud provider cost allocation tags to attribute costs to specific clients
- For shared infrastructure (Kubernetes control plane, shared storage), allocate costs proportionally based on GPU usage
- Generate monthly cost reports per client showing training costs, inference costs, and storage costs
- Compare actual costs to project budgets and flag projects that are exceeding their GPU budget
Monitoring and Observability
GPU Metrics Collection
Essential GPU metrics:
- GPU utilization (compute)
- GPU memory usage (allocated and peak)
- GPU temperature
- GPU power consumption
- GPU clock speed (throttling detection)
- PCIe bandwidth utilization
- NVLink bandwidth utilization (for multi-GPU systems)
Collection tools:
- NVIDIA DCGM (Data Center GPU Manager): The standard tool for collecting GPU metrics. Integrates with Prometheus, Grafana, and cloud monitoring services.
- nvidia-smi: Command-line tool for real-time GPU monitoring. Useful for debugging but not suitable for production monitoring at scale.
- GPU Operator metrics: When using the NVIDIA GPU Operator on Kubernetes, GPU metrics are automatically exposed as Prometheus metrics.
Alerting Rules
Critical alerts (page on-call immediately):
- GPU utilization drops to 0% on an inference node (model serving may have crashed)
- GPU temperature exceeds 90C (risk of hardware damage or throttling)
- GPU memory OOM (out-of-memory) errors on inference nodes
- Inference latency exceeds 5x the p95 baseline
Warning alerts (notify team during business hours):
- Average GPU utilization below 20% for more than 1 hour (resource waste)
- GPU utilization above 95% sustained for more than 30 minutes (may need more capacity)
- Spot instance interruption rate exceeds 20% in a 24-hour period
- Training job has not produced a checkpoint in more than 2 hours
Capacity Planning
Capacity planning process:
- Review current GPU usage patterns across all clients
- Identify upcoming demand changes (new clients onboarding, existing clients scaling, model retraining schedules)
- Forecast GPU demand for the next 1-3 months
- Compare forecast to current capacity and reserved instance commitments
- Adjust reserved instances, spot instance budgets, and autoscaling limits to match forecast
- Review and adjust monthly
Security and Compliance
GPU Workload Security
Container security for GPU workloads:
- Use minimal base images with only the required CUDA and cuDNN libraries
- Scan container images for vulnerabilities before deployment
- Do not run GPU containers as root โ use non-root users with only the necessary permissions
- Apply network policies to restrict GPU pod communication to only required services
Data security:
- Encrypt data at rest in GPU memory using hardware encryption (available on A100 and H100)
- Encrypt data in transit between GPU nodes using TLS
- For multi-tenant clusters, ensure GPU memory is cleared between workloads (use NVIDIA MIG or container isolation)
- Do not store client data on GPU instance local storage โ use network-attached encrypted storage
Access control:
- Implement multi-factor authentication for accessing GPU infrastructure
- Use short-lived credentials for programmatic access
- Audit all access to GPU resources and configuration changes
- Implement least-privilege access โ team members should only have access to the GPU resources they need
Compliance Considerations
For agencies serving regulated industries (healthcare, finance, government), GPU infrastructure must meet compliance requirements.
Common compliance requirements:
- Data residency: GPU workloads must run in specific geographic regions. Verify that GPU instance types are available in the required regions.
- Audit logging: All GPU resource provisioning, access, and configuration changes must be logged. Use cloud provider audit trails (CloudTrail, Cloud Audit Logs).
- Encryption: Data must be encrypted at rest and in transit. Use cloud-managed encryption keys or customer-managed keys depending on the requirement.
- Isolation: Client workloads must be isolated from each other. Use dedicated nodes, network isolation, and separate storage volumes per client.
Your Next Step
Run a GPU utilization audit this week. For every GPU instance your agency is running, record the average GPU utilization, memory utilization, and the workload type (training or inference). Identify every instance with average utilization below 30%. For each underutilized instance, determine whether the workload can be moved to a smaller GPU, consolidated with another workload through GPU sharing, or scheduled on spot instances. Implement the easiest three optimizations. Based on the patterns we see across agencies, this audit will identify at least 30% in GPU cost savings โ savings that go directly to your bottom line. GPU cost optimization is not glamorous work, but it is the kind of operational discipline that turns a breakeven agency into a profitable one.