Your BERT-based classification model achieves 94% accuracy โ excellent. But it takes 120ms per inference on a GPU, which means $8,000/month in GPU costs to handle the client's production traffic. The client's budget for inference infrastructure is $2,000/month. You need to make the model 4x faster without dropping below 90% accuracy. Model compression makes this possible.
Model compression encompasses techniques that reduce model size, inference time, and computational requirements while preserving as much accuracy as possible. For AI agencies delivering production systems, compression is often the difference between a model that performs well in development and a model that is economically viable in production.
Compression Techniques
Quantization
Reduce the numerical precision of model weights and activations โ from 32-bit floating point (FP32) to 16-bit (FP16), 8-bit integers (INT8), or even 4-bit.
Post-training quantization: Apply quantization to a trained model without retraining. The simplest approach โ convert FP32 weights to INT8, typically achieving 2-4x speed improvement with less than 1% accuracy loss.
Quantization-aware training (QAT): Simulate quantization during training so the model learns to be robust to reduced precision. Produces better accuracy than post-training quantization, especially at aggressive quantization levels (INT4).
Dynamic vs. static quantization: Dynamic quantization determines quantization parameters at runtime, while static quantization calibrates parameters using a representative dataset. Static is faster at inference; dynamic is simpler to implement.
When to use: Quantization should be your first compression technique. It provides significant speedup with minimal accuracy loss and requires minimal effort to implement.
Knowledge Distillation
Train a smaller "student" model to mimic the behavior of a larger "teacher" model.
Process: The teacher model (your large, accurate model) generates soft predictions on the training data. The student model (a smaller architecture) is trained to match these soft predictions rather than the hard labels. The soft predictions contain richer information than hard labels โ they encode the teacher's uncertainty and inter-class relationships.
Architecture selection: The student model can be a smaller version of the same architecture (fewer layers, fewer hidden units) or a completely different architecture (distilling a transformer into a simpler model).
Typical results: Distillation can produce models 3-10x smaller with 1-3% accuracy loss, depending on the task and the student architecture.
When to use: When you need significant size or speed improvements and are willing to invest in training a new model. Distillation is especially effective when the teacher model is much larger than needed for the task.
Pruning
Remove unnecessary weights, neurons, or layers from the model.
Weight pruning (unstructured): Set individual weights to zero based on magnitude โ small weights are assumed to be less important. Unstructured pruning can remove 50-90% of weights with minimal accuracy loss, but the sparse weight matrices may not translate to actual speedup without specialized hardware or libraries.
Structural pruning: Remove entire neurons, channels, or attention heads. Structural pruning produces genuinely smaller models that run faster on standard hardware, but typically achieves less aggressive compression than unstructured pruning.
Iterative pruning: Prune gradually โ remove a small percentage of weights, fine-tune the model to recover accuracy, then prune more. Iterative pruning achieves better accuracy than one-shot pruning at the same compression level.
When to use: When you need to reduce model size and can afford the engineering effort of pruning and fine-tuning. Structural pruning is preferred when you need actual inference speedup on commodity hardware.
Architecture Optimization
Use more efficient model architectures from the start rather than compressing a large model after training.
Efficient architectures: MobileNet, EfficientNet, and SqueezeNet for computer vision. DistilBERT, TinyBERT, and ALBERT for NLP. These architectures are designed for efficiency and often provide excellent accuracy-speed trade-offs.
Neural Architecture Search (NAS): Automated search for the optimal architecture given accuracy and efficiency constraints. NAS can discover architectures that are more efficient than hand-designed alternatives, but the search process is computationally expensive.
When to use: When starting a new project, choose an efficient architecture from the start rather than compressing a large model later. Compression makes the most sense when you already have a trained large model.
TensorRT and Runtime Optimization
Optimize model execution for specific hardware using inference optimization tools.
TensorRT (NVIDIA): Optimizes model execution for NVIDIA GPUs through operator fusion, precision calibration, and kernel auto-tuning. Can provide 2-5x speedup without any model architecture changes.
ONNX Runtime: Cross-platform inference optimization that works across CPU and GPU. Provides optimization through graph optimization, quantization, and hardware-specific execution providers.
OpenVINO (Intel): Optimizes inference for Intel CPUs and accelerators. Useful for edge deployment on Intel hardware.
When to use: Always. Runtime optimization should be applied to every production model regardless of other compression techniques used. It provides "free" speedup from better execution, not model changes.
Compression Strategy
Evaluating Trade-Offs
Accuracy vs. speed: Every compression technique trades some accuracy for speed. Define the minimum acceptable accuracy before compressing, and stop compressing when you reach it.
Compression pipeline: Techniques can be combined for cumulative benefit. A typical pipeline: start with the full model, apply quantization (2-4x speedup), apply distillation if more compression is needed (additional 2-5x), then apply runtime optimization (additional 1.5-3x).
Task-specific tolerance: Some tasks tolerate accuracy loss better than others. A recommendation system that drops from 88% to 85% relevance may not affect user experience. A medical diagnostic model that drops from 96% to 93% may cross a clinical safety threshold.
Evaluation Methodology
Same test set: Evaluate compressed models on the same test set as the original model. Report accuracy delta alongside speedup.
Latency profiling: Measure actual inference latency on the target hardware, not theoretical FLOPs reduction. Some compression techniques provide theoretical speedup that does not materialize in practice due to memory access patterns and hardware constraints.
Production-like conditions: Measure performance under production-like conditions โ concurrent requests, realistic input distributions, and production hardware. Batch size significantly affects the benefit of different compression techniques.
Client Delivery
Cost Modeling
Present compression as a business decision. "The uncompressed model costs $8,000/month in GPU compute. After quantization and distillation, the compressed model achieves 91% accuracy (versus 94% original) at $1,800/month in compute โ a 78% cost reduction with a 3-point accuracy trade-off."
Compression as a Service
Offer model compression as a distinct service offering for clients with existing models that are too expensive to serve at scale. This is a focused, well-defined engagement with clear deliverables and measurable outcomes.
Model compression is essential engineering for production AI. Development environments are forgiving โ generous compute, no cost pressure, no latency requirements. Production is demanding โ every millisecond of latency matters, every dollar of compute costs money, and models must serve thousands of requests efficiently. Master compression techniques, and your agency can deliver models that are not just accurate but economically viable in production.