AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

What Knowledge Distillation Actually DoesWhen to Use Distillation in Agency WorkScenario 1: Reducing Inference CostsScenario 2: Edge DeploymentScenario 3: Latency RequirementsScenario 4: Distilling LLM CapabilitiesThe Distillation ProcessStep 1: Train (or Select) the Teacher ModelStep 2: Generate Soft LabelsStep 3: Design the Student ArchitectureStep 4: Train the StudentStep 5: Evaluate the StudentAdvanced Distillation TechniquesMulti-Layer DistillationProgressive DistillationSelf-DistillationData-Free DistillationCommon Distillation MistakesPricing Distillation WorkYour Next Step
Home/Blog/A 350M-Parameter Model and a $14K Monthly GPU Bill
Delivery

A 350M-Parameter Model and a $14K Monthly GPU Bill

A

Agency Script Editorial

Editorial Team

ยทMarch 20, 2026ยท12 min read
model distillationmodel compressionedge deploymentinference optimization

Knowledge Distillation for Deploying Smaller Models: The Agency Efficiency Guide

A retail analytics agency in Boston built a product recommendation engine for a major grocery chain. The model was a deep neural network with 350 million parameters that delivered excellent recommendations โ€” 23% better click-through rates than the previous rule-based system. One problem: inference cost. The model ran on GPU instances costing $14,000 per month to serve 8 million daily recommendation requests. The grocery chain's IT budget allocated $4,000 per month for the recommendation infrastructure. The model was too expensive to deploy.

The agency used knowledge distillation to train a smaller "student" model โ€” 12 million parameters โ€” that learned to mimic the large model's behavior. The student model captured 91% of the teacher model's recommendation quality while running on CPU instances at $1,800 per month. The grocery chain deployed it immediately. The agency saved a deal that was about to die over infrastructure costs.

Knowledge distillation is one of the most practical and underutilized techniques in the agency toolkit. It bridges the gap between the large, accurate models you build during development and the small, efficient models your clients can actually afford to run in production. If you deploy models to production โ€” especially at the edge or at scale โ€” you need distillation in your toolkit.

What Knowledge Distillation Actually Does

Knowledge distillation transfers the learned behavior of a large, complex model (the "teacher") to a smaller, simpler model (the "student"). The student learns not from the raw training data directly, but from the teacher's outputs โ€” its predictions, its uncertainty, and its intermediate representations.

Why this works better than just training a small model directly:

When you train a small model on raw training data, the labels are hard โ€” "this is category A" or "this is not fraud." The small model must learn the same patterns from scratch with fewer parameters.

When you train a small model on the teacher's outputs, the labels are soft โ€” "this is 80% category A, 15% category B, and 5% category C." These soft labels contain much more information:

  • They reveal relationships between categories (category A is more similar to category B than to category C)
  • They indicate confidence (this example is clearly category A vs. this example could be A or B)
  • They encode the complex patterns the teacher learned, in a form that is easier for a smaller model to absorb

The result: A distilled student model is typically 5-20x smaller than the teacher while retaining 90-97% of the teacher's performance. That is a dramatic improvement in the efficiency-accuracy tradeoff.

When to Use Distillation in Agency Work

Scenario 1: Reducing Inference Costs

The most common scenario. Your model is accurate but too expensive to serve at the client's traffic volume.

Example: A text classification model using a 110M-parameter BERT model processes 500,000 documents per day. Inference costs $8,000 per month on GPU instances. A distilled model using a 4-layer BERT (22M parameters) achieves 94% of the accuracy on CPU instances costing $800 per month.

Scenario 2: Edge Deployment

The client needs the model to run on devices without internet connectivity โ€” mobile phones, IoT sensors, factory floor cameras, vehicles.

Example: A quality inspection model for a manufacturing client needs to run on an NVIDIA Jetson Nano at the production line. The original model (ResNet-152) does not fit in the device's 4GB memory. A distilled MobileNet student model fits in 500MB and processes images at 30 frames per second.

Scenario 3: Latency Requirements

The application requires sub-10ms predictions, and the large model takes 50ms.

Example: A real-time bidding system for an ad tech client needs to score 100,000 bid requests per second with sub-5ms latency. The full model takes 20ms. A distilled model takes 3ms.

Scenario 4: Distilling LLM Capabilities

You are using GPT-4 or Claude for a specific task during development, but the per-prediction cost makes it uneconomical for production use.

Example: You used GPT-4 to classify customer support tickets during development โ€” perfect accuracy, $0.03 per ticket. At 50,000 tickets per month, that is $1,500 in API costs. You use GPT-4's predictions as training data for a fine-tuned DistilBERT model that costs $0.0001 per ticket โ€” $5 per month.

The Distillation Process

Step 1: Train (or Select) the Teacher Model

The teacher model is your best model โ€” the one you would deploy if cost and latency were not constraints. It should be thoroughly validated and represent the quality ceiling.

Teacher model guidelines:

  • Use the largest, most accurate model you can train on available data
  • Validate thoroughly โ€” the student will learn the teacher's mistakes as well as its knowledge
  • Keep the teacher's training data available โ€” you will need it for distillation

Step 2: Generate Soft Labels

Run the teacher model on the training set (or a large unlabeled dataset) and capture its predictions as probability distributions, not hard labels.

For classification tasks: Save the full softmax probability vector, not just the top class. The probabilities for non-top classes contain valuable information about inter-class similarities.

Temperature scaling: The teacher's softmax output is usually very peaked โ€” 99% probability on one class, 1% spread across the rest. Temperature scaling "softens" this distribution, exposing more of the teacher's learned structure.

Apply a temperature parameter T > 1 to the softmax:

  • T = 1: Standard softmax (peaked distribution)
  • T = 2-5: Moderate softening (common range for distillation)
  • T = 10-20: Very soft distribution (reveals fine-grained class relationships)

The right temperature depends on the task. Start with T = 3 and experiment. Higher temperatures are better when the teacher has learned rich inter-class relationships. Lower temperatures are better when the correct class is clearly dominant.

For regression tasks: Use the teacher's predictions directly as training targets for the student.

For embedding/representation tasks: Capture the teacher's intermediate representations (hidden layer outputs) in addition to final predictions. The student can learn to mimic these representations, capturing deeper knowledge.

Step 3: Design the Student Architecture

The student model should be:

  • Smaller than the teacher (obviously) โ€” typically 3-20x fewer parameters
  • Appropriate for the deployment target โ€” fits in the device's memory, meets the latency requirement
  • Compatible with the deployment framework โ€” ONNX-compatible for cross-platform deployment, TensorRT-compatible for NVIDIA hardware
  • Of the same general family, if possible โ€” distilling a large transformer teacher into a small transformer student works better than distilling into a completely different architecture

Common student architecture choices:

For NLP:

  • DistilBERT (6 layers, 66M params) as student for BERT-base (12 layers, 110M params)
  • TinyBERT (4 layers, 14.5M params) for more aggressive compression
  • A simple LSTM or CNN text classifier for maximum efficiency

For computer vision:

  • MobileNet or EfficientNet-B0 as student for ResNet-152 or EfficientNet-B7
  • SqueezeNet for extreme edge deployment constraints

For tabular ML:

  • A shallow gradient-boosted model (100 trees, depth 3) as student for a deep ensemble (1000 trees, depth 8)
  • A single neural network as student for an ensemble of diverse models

Step 4: Train the Student

The student model trains on a combination of two loss functions:

Distillation loss: How well does the student match the teacher's soft predictions? Measured as KL divergence between the student and teacher probability distributions (both computed with the same temperature T).

Hard label loss: How well does the student predict the correct hard labels? Measured with standard cross-entropy against the ground truth labels.

The combined loss: Total Loss = alpha Distillation Loss + (1 - alpha) Hard Label Loss

Alpha controls the balance:

  • alpha = 1.0: Student learns only from the teacher (pure distillation)
  • alpha = 0.0: Student learns only from hard labels (standard training, no distillation)
  • alpha = 0.5-0.7: Typical sweet spot โ€” student learns primarily from the teacher but ground truth keeps it honest

Training tips:

  • Use a lower learning rate than standard training (the soft labels are already informative, so the student does not need to learn as aggressively)
  • Train for more epochs than standard training (the student needs more exposure to compensate for fewer parameters)
  • Use data augmentation to increase the effective training set size
  • Monitor both the distillation loss and the hard label loss separately to diagnose training issues

Step 5: Evaluate the Student

Compare the student against:

  1. The teacher model (quality ceiling)
  2. A same-sized model trained without distillation (to measure the benefit of distillation)
  3. The deployment requirements (latency, memory, cost)

Key metrics:

  • Accuracy retention: student accuracy / teacher accuracy (target: 90-97%)
  • Speedup: teacher inference time / student inference time (target: 3-20x)
  • Size reduction: teacher model size / student model size (target: 3-20x)
  • Cost reduction: teacher deployment cost / student deployment cost

If the student retains less than 90% of the teacher's accuracy, consider:

  • Increasing the student size (add more layers or parameters)
  • Adjusting the temperature parameter
  • Using intermediate representation matching (not just final layer distillation)
  • Generating more training data with the teacher
  • Using a more gradual distillation approach (distill to a medium model first, then from medium to small)

Advanced Distillation Techniques

Multi-Layer Distillation

Instead of only matching the teacher's final predictions, also match intermediate layer representations. The student's hidden layers learn to mimic the teacher's hidden layers, capturing deeper structural knowledge.

This requires mapping student layers to teacher layers (since the student has fewer layers). Common strategies:

  • Map every student layer to the corresponding teacher layer at the same relative position
  • Use a projection layer to match dimension differences between student and teacher representations

Progressive Distillation

For very large teacher-to-student size ratios, distill in stages:

  1. Distill the large teacher into a medium assistant teacher
  2. Distill the medium teacher into the small target student

Each step has a modest compression ratio, which works better than one large compression step.

Self-Distillation

Train a model, then use it as a teacher to train an identical architecture from scratch. Surprisingly, the self-distilled model often performs better than the original, because the soft labels smooth out noise in the training data.

Data-Free Distillation

When the original training data is unavailable (due to privacy, licensing, or practical reasons), generate synthetic data using the teacher model and train the student on that synthetic data. The teacher generates examples and their predictions; the student learns from both.

Common Distillation Mistakes

Mistake 1: Distilling a teacher that has not been validated. The student inherits the teacher's errors. If the teacher has systematic biases or failure modes, the student will learn those too. Validate the teacher thoroughly before using it to generate training signals.

Mistake 2: Using too small a student. Aggressive compression (100x smaller) almost always degrades quality unacceptably. Start conservative (5-10x smaller) and compress further only if the accuracy-latency tradeoff demands it.

Mistake 3: Not tuning the temperature. The default temperature of 1.0 produces hard labels from the teacher, negating much of the distillation benefit. Experiment with temperatures between 2 and 10 to find the sweet spot.

Mistake 4: Skipping the direct training comparison. Always train a same-sized student without distillation to measure the actual benefit of the distillation process. If the distilled student performs only marginally better than the directly trained model, the distillation overhead may not be justified.

Pricing Distillation Work

Distillation is typically a component of a larger model deployment project, not a standalone engagement:

  • Distillation as part of deployment optimization: $10,000 - $25,000 additional on top of the base model development cost
  • Standalone distillation project (optimizing an existing model for a new deployment target): $20,000 - $50,000
  • LLM-to-specialized-model distillation: $30,000 - $60,000

Frame the value in terms of operational savings: "The distilled model costs $1,800 per month to serve instead of $14,000. That is $146,000 in annual savings. The one-time distillation cost of $25,000 pays for itself in two months."

Your Next Step

Look at your most expensive deployed model โ€” the one with the highest monthly inference cost. Calculate the annual serving cost. Then estimate: if the model were 10x smaller and ran on CPU instead of GPU, what would the serving cost be? The difference between those two numbers is the value of distillation. If the savings exceed $50,000 annually, a distillation project is easily justified. Start with a simple experiment: train a student model at 10% of the teacher's parameter count using the teacher's soft predictions. Measure the accuracy-cost tradeoff. Most of the time, you will be surprised at how much quality the student retains.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification