AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

The Inference Optimization StackLevel 1: Model-Level OptimizationLevel 2: Runtime OptimizationLevel 3: Infrastructure OptimizationInference Optimization for Edge DeploymentCommon Inference Optimization MistakesDelivery ProcessPhase 1: Profiling and Benchmarking (Weeks 1-2)Phase 2: Model Optimization (Weeks 3-6)Phase 3: Serving Optimization (Weeks 7-10)Phase 4: Infrastructure Optimization (Weeks 11-14)Optimization Case Studies by Model TypeComputer Vision ModelsNLP and Transformer ModelsLarge Language ModelsProfiling Deep Dive: Identifying Hidden BottlenecksMeasuring Optimization SuccessBuilding an Optimization PipelinePricing Inference Optimization EngagementsYour Next Step
Home/Blog/85 Milliseconds Was Costing This Ad Tech Firm $180K a Month
Delivery

85 Milliseconds Was Costing This Ad Tech Firm $180K a Month

A

Agency Script Editorial

Editorial Team

ยทMarch 21, 2026ยท14 min read
ai inference optimizationmodel servingproduction aiai performance delivery

An ad-tech company had a real-time bidding model that needed to return predictions in under 10 milliseconds. Their initial deployment served predictions in 85 milliseconds โ€” perfectly fine for a dashboard but eight times too slow for real-time bidding. They were losing $180,000 per month in missed bid opportunities because their model could not keep up. An AI agency optimized their inference pipeline through model quantization, ONNX Runtime compilation, batching optimization, and infrastructure right-sizing. The result: p99 latency dropped to 7 milliseconds. Throughput increased from 1,200 to 14,000 predictions per second on the same hardware. Monthly infrastructure cost decreased by 60 percent because they no longer needed the oversized GPU instances they had been using to compensate for inefficient serving. The optimization engagement cost $95,000 and generated $2.16 million in annual value from recovered revenue and reduced infrastructure costs.

Inference optimization is where the rubber meets the road for production AI. A model that takes two seconds to respond is useless for real-time applications. A model that costs $0.05 per prediction is uneconomical at scale. Your agency's ability to optimize inference is a critical differentiator.

The Inference Optimization Stack

Level 1: Model-Level Optimization

Optimizations applied to the model itself before deployment.

Quantization. Reduce the numerical precision of model weights and activations.

  • FP32 to FP16: Halves memory usage, nearly doubles throughput on GPU. Accuracy loss is typically less than 0.1 percent. This should be the default for all GPU inference.
  • FP16 to INT8: Halves memory again, significant throughput increase on hardware with INT8 support. Accuracy loss is typically 0.5 to 2 percent. Requires calibration with representative data.
  • INT4 and lower: Aggressive quantization for LLMs (GPTQ, AWQ, GGML). Enables running large models on smaller hardware. Accuracy loss of 1 to 5 percent but enables deployment scenarios that would otherwise be impossible.

Distillation. Train a smaller student model to mimic the behavior of the larger teacher model. The student model is cheaper and faster to serve while retaining most of the teacher's performance.

  • Typically 3x to 10x reduction in model size
  • 1 to 5 percent accuracy loss depending on the complexity gap
  • Best for high-volume inference where per-prediction cost is critical

Pruning. Remove redundant parameters from the model.

  • Unstructured pruning: Remove individual weights. Produces sparse models that require specialized hardware or software for acceleration.
  • Structured pruning: Remove entire neurons, channels, or attention heads. Produces dense models that accelerate on standard hardware. Typically 30 to 50 percent size reduction with under 2 percent accuracy loss.

Architecture optimization. Use efficient model architectures designed for inference.

  • MobileNet, EfficientNet for vision tasks
  • DistilBERT, TinyBERT, MiniLM for NLP tasks
  • Speculative decoding for LLM inference acceleration

Level 2: Runtime Optimization

Optimizations in how the model is executed.

Model compilation. Convert the model to an optimized format for the target hardware.

  • ONNX Runtime: Cross-platform model optimization. Converts PyTorch/TensorFlow models to ONNX format and applies graph optimizations. Typically 1.5x to 3x speedup over native frameworks.
  • TensorRT: NVIDIA-specific optimization for GPU inference. Applies layer fusion, kernel selection, and precision calibration. Typically 2x to 5x speedup on NVIDIA GPUs.
  • TVM (Apache TVM): Hardware-agnostic compiler that optimizes models for any hardware target. Particularly useful for edge deployment.
  • vLLM: Specialized for LLM inference with PagedAttention for efficient memory management. Dramatically improves LLM serving throughput and reduces per-token cost.

Batching. Process multiple predictions simultaneously to utilize hardware efficiently.

  • Static batching: Collect a fixed number of requests and process them together. Simple but adds latency as requests wait for the batch to fill.
  • Dynamic batching: Collect requests over a short time window (1 to 10 milliseconds) and process whatever has accumulated. Balances throughput and latency.
  • Continuous batching (for LLMs): Process requests at the token level rather than the sequence level. Requests enter and exit the batch independently, dramatically improving throughput for variable-length LLM outputs.

Caching. Avoid recomputing results for repeated inputs.

  • Exact match caching: Cache prediction results for identical inputs. Effective for applications with high input repetition.
  • KV cache (for LLMs): Cache key-value pairs from attention layers for previously seen tokens. Essential for efficient LLM inference.
  • Embedding caching: Cache intermediate embeddings for frequently seen inputs.

Level 3: Infrastructure Optimization

Optimizations in the serving infrastructure.

Hardware selection. Match the hardware to the workload.

  • CPU inference: For models under 100M parameters with moderate throughput requirements. Cost-effective and widely available. Best with ONNX Runtime or TVM optimization.
  • GPU inference: For larger models, high throughput requirements, or latency-critical applications. Use T4 or L4 for cost-effective inference, A10G for balanced performance, A100 or H100 for maximum throughput.
  • Specialized accelerators: AWS Inferentia, Google TPU, Intel Gaudi for specific workloads. Can provide better price-performance than GPUs for supported model architectures.

Autoscaling. Scale serving infrastructure based on demand.

  • Reactive autoscaling: Add or remove instances based on current load (CPU utilization, request queue depth, latency). Standard approach but reacts to load after it arrives.
  • Predictive autoscaling: Use historical traffic patterns to provision capacity before demand arrives. Reduces latency spikes during traffic ramps.
  • Scale-to-zero: For low-traffic models, scale down to zero instances during idle periods. Eliminates waste but adds cold-start latency for the first request.

Multi-model serving. Run multiple models on the same infrastructure.

  • Model multiplexing: Load multiple models into GPU memory and serve them from the same instance. Efficient for organizations with many small models.
  • Model swapping: Load models on demand, swapping them in and out of GPU memory. Useful for organizations with many models that are accessed infrequently.

Inference Optimization for Edge Deployment

Edge deployment โ€” running models on devices at the network edge rather than in the cloud โ€” introduces unique optimization requirements.

Model size constraints. Edge devices have limited memory (often 2-8 GB RAM). Models must be aggressively optimized through quantization (INT8 or INT4), pruning, and distillation to fit within device memory constraints. A model that runs comfortably on a cloud GPU with 80 GB memory may need to be reduced by 10x or more for edge deployment.

Compute constraints. Edge devices often lack GPUs and rely on CPU, NPU (Neural Processing Unit), or specialized AI accelerators. Optimize models for the specific hardware available on the target device. Use framework-specific optimizations โ€” TensorFlow Lite for mobile devices, Core ML for Apple devices, ONNX Runtime for cross-platform deployment.

Network considerations. Edge inference eliminates network latency (no round-trip to a cloud server) but also eliminates the ability to update models instantly. Design model update mechanisms that can push new model versions to edge devices reliably, with rollback capability if the new model performs poorly.

Batch processing on edge. Unlike cloud inference which processes individual requests, edge inference often processes batches of data collected over time (sensor readings, camera frames, log files). Optimize for batch processing efficiency rather than single-request latency.

Common Inference Optimization Mistakes

Mistake 1: Optimizing before measuring. Teams apply quantization or compilation without first profiling to understand where time is actually spent. If preprocessing consumes 60 percent of total time, optimizing the model (the other 40 percent) has limited impact. Always profile first.

Mistake 2: Over-optimizing and losing accuracy. Aggressive INT4 quantization or heavy pruning can degrade model accuracy significantly. A model that serves predictions 3x faster but is 10 percent less accurate may provide worse business outcomes. Always validate accuracy after optimization.

Mistake 3: Benchmarking on unrealistic inputs. Testing with uniform input sizes when production inputs vary dramatically. A text model that processes 50-token inputs in benchmarks but receives 5,000-token inputs in production will have very different real-world performance. Use production-representative input distributions for benchmarking.

Mistake 4: Ignoring cold start. Some optimization techniques (model compilation, dynamic batching) have significant startup time. If the model is frequently cold-started (scale-to-zero, container restarts), the cold-start penalty may outweigh the steady-state optimization gains. Factor cold-start into the overall performance assessment.

Delivery Process

Phase 1: Profiling and Benchmarking (Weeks 1-2)

  • Profile the current inference pipeline to identify bottlenecks
  • Benchmark baseline performance (latency, throughput, cost per prediction)
  • Define optimization targets (target latency, target throughput, target cost)
  • Identify the highest-impact optimization opportunities

Phase 2: Model Optimization (Weeks 3-6)

  • Apply quantization and benchmark impact on accuracy and speed
  • Compile model with ONNX Runtime or TensorRT
  • Evaluate distillation if cost targets require smaller models
  • Benchmark optimized model against targets

Phase 3: Serving Optimization (Weeks 7-10)

  • Implement batching strategy
  • Configure caching
  • Optimize serving framework configuration
  • Implement autoscaling policies

Phase 4: Infrastructure Optimization (Weeks 11-14)

  • Right-size hardware based on optimized model requirements
  • Implement multi-model serving if applicable
  • Deploy monitoring for inference performance
  • Load test at 2x to 5x expected production volume
  • Deploy to production and validate performance

Optimization Case Studies by Model Type

Computer Vision Models

Computer vision models (image classification, object detection, segmentation) typically benefit most from:

  1. TensorRT compilation: 3x to 5x speedup on NVIDIA GPUs. Computer vision models have many convolution operations that TensorRT optimizes aggressively.
  2. INT8 quantization with calibration: 2x additional speedup with less than 1 percent accuracy loss for most vision tasks.
  3. Input resolution optimization: Many vision models are served with higher input resolution than necessary. Reducing from 1024x1024 to 512x512 can provide 4x speedup with minimal accuracy impact for many tasks.
  4. Batch inference: Vision models scale very well with batching. Processing 32 images in a batch is typically 10x to 15x more efficient than processing them individually.

NLP and Transformer Models

Transformer-based models (BERT, RoBERTa, DeBERTa) benefit from:

  1. ONNX Runtime with graph optimization: 1.5x to 3x speedup from operation fusion and graph optimizations.
  2. Dynamic sequence length padding: Instead of padding all inputs to the maximum length, pad to the longest input in each batch. This eliminates wasted computation on padding tokens.
  3. Distillation: DistilBERT achieves 97 percent of BERT's accuracy with 60 percent fewer parameters and 60 percent faster inference.
  4. Quantization: INT8 quantization works well for most NLP tasks with less than 0.5 percent accuracy loss.

Large Language Models

LLMs require specialized optimization techniques:

  1. vLLM or TensorRT-LLM: Purpose-built serving frameworks that implement PagedAttention, continuous batching, and tensor parallelism. 3x to 8x throughput improvement over naive serving.
  2. KV cache optimization: Efficient management of the key-value cache is critical for LLM performance. PagedAttention reduces memory waste from 60 percent to near zero.
  3. Quantization (GPTQ, AWQ): INT4 quantization reduces memory requirements by 4x, enabling larger models on smaller hardware.
  4. Speculative decoding: Use a small draft model to generate candidate tokens verified by the large model. 2x to 3x latency reduction for generation tasks.

Profiling Deep Dive: Identifying Hidden Bottlenecks

The most impactful optimization opportunities are often not where you expect them. Here are common hidden bottlenecks that profiling reveals.

Data loading. If the GPU is waiting for data to arrive from CPU memory, the GPU is idle during data transfer. Profile the data transfer time between CPU and GPU. Solutions: pin memory, use async data loading, preprocess data on GPU.

Tokenization and preprocessing. For NLP models, tokenization can consume 10 to 30 percent of total latency. Solutions: use fast tokenizers (Rust-based tokenizers in the Hugging Face library), batch tokenization, or cache tokenized inputs.

Post-processing. Output decoding, formatting, and business logic after inference can add significant latency. Solutions: optimize post-processing code, move post-processing to a separate async step, or cache common outputs.

Network latency. If the model is called as a service, network round-trip time adds to every prediction. Solutions: co-locate the model with the calling application, use gRPC instead of REST (lower overhead), or batch requests to amortize network cost.

Memory allocation. Frequent memory allocation and deallocation on GPU can cause fragmentation and slow down inference. Solutions: use memory pools, pre-allocate buffers, and avoid dynamic shapes where possible.

Measuring Optimization Success

Track these metrics before and after each optimization to quantify the improvement:

Latency metrics:

  • P50, P95, P99 latency at production load levels
  • Cold start latency (first request after model loading)
  • Latency variance (standard deviation)

Throughput metrics:

  • Maximum requests per second at SLA latency
  • Throughput at various concurrency levels
  • Throughput per GPU (efficiency metric)

Cost metrics:

  • Cost per prediction
  • Monthly infrastructure cost at current and projected volumes
  • GPU utilization at production load

Quality metrics:

  • Accuracy on benchmark dataset (must not degrade)
  • Output distribution comparison (must remain consistent)

Building an Optimization Pipeline

Rather than treating inference optimization as a one-time project, build a repeatable optimization pipeline that can be applied to any model as it moves to production.

Standard optimization checklist. Every model going to production should pass through a standard optimization checklist: profile baseline performance, apply FP16 quantization, compile with ONNX Runtime or TensorRT, configure dynamic batching, implement caching for repeated inputs, right-size the serving hardware, and configure autoscaling. This checklist catches the low-hanging fruit that provides 2x to 5x improvement for most models with minimal risk.

Optimization regression testing. After applying any optimization, run the model through a comprehensive quality validation suite. Compare accuracy metrics, output distributions, and edge case behavior between the optimized and original models. Any optimization that degrades quality beyond acceptable thresholds should be reverted or adjusted.

Performance monitoring after optimization. Production performance often differs from benchmark performance due to real-world traffic patterns, concurrent load, and input distribution differences. Monitor latency, throughput, and GPU utilization after deploying optimized models and be prepared to adjust optimization parameters based on production observations.

Continuous optimization. As serving frameworks, hardware, and model architectures evolve, new optimization opportunities emerge. Schedule quarterly optimization reviews for production models to evaluate whether newer techniques or frameworks could improve performance. A model optimized with last year's tools may achieve 30 percent better performance with this year's framework updates.

Pricing Inference Optimization Engagements

  • Inference performance audit: $10,000 to $25,000
  • Single model optimization: $30,000 to $80,000
  • Multi-model optimization and serving platform: $80,000 to $200,000
  • Ongoing performance optimization: $5,000 to $15,000 per month

Value-based pricing is compelling here. If optimization reduces monthly serving costs from $50,000 to $15,000, the annual savings of $420,000 easily justifies a $100,000 engagement.

Your Next Step

This week: Profile the inference performance of every model your agency has deployed. Identify which models are using oversized hardware, which lack quantization, and which have unoptimized serving configurations.

This month: Build an inference optimization toolkit โ€” scripts and procedures for quantization, ONNX conversion, TensorRT compilation, and performance benchmarking.

This quarter: Deliver your first inference optimization engagement. Start with the models where the gap between current performance and target performance is largest.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification