Model distillation is the practice of training a small model to reproduce the behavior of a larger, more capable one. The large model is the teacher. The small model is the student. Instead of learning only from hard labels in your dataset, the student learns from the teacher's full output, which carries far more information than a single correct answer. The result, when it works, is a model that runs five to twenty times cheaper and faster while keeping most of the teacher's quality.
This matters because the most accurate models are usually the most expensive to run. A model that costs cents per request is fine for a demo and ruinous at a million requests a day. Distillation is the bridge: you pay the teacher's cost once to generate training signal, then serve the student forever at a fraction of the price.
This guide covers the full picture. What distillation actually is at the mechanism level, the main variants, how to run a project end to end, where it breaks, and how to decide whether it is the right tool for your problem. If you are brand new to the concept, start with What Is Model Distillation: A Beginner's Guide and come back here.
What Distillation Actually Does
The core idea is that a trained model knows more than its final answer reveals. When a classifier predicts "cat," it also assigns probabilities to "dog," "fox," and "lynx." Those secondary probabilities — the soft targets — encode how the teacher sees the relationships between classes. A confident "cat at 0.97, dog at 0.02" teaches something different from "cat at 0.55, dog at 0.40."
Soft Targets and Temperature
Distillation trains the student to match the teacher's probability distribution, not just the top label. A temperature parameter is applied to the teacher's logits to soften the distribution and expose more of this structure. Higher temperature spreads probability mass across classes and surfaces the teacher's "dark knowledge." The student's loss combines two terms: how well it matches the teacher's soft distribution, and how well it matches the true hard labels.
Why This Beats Training From Scratch
A small model trained directly on hard labels has to discover decision boundaries from limited data. A distilled student gets a continuous, information-rich signal at every example. It is the difference between being told the answer and being shown the reasoning behind every answer. That extra signal is why a distilled student routinely beats an identical architecture trained on the same data without a teacher.
The Main Types of Distillation
Distillation is a family, not a single technique. The variant you choose depends on what access you have to the teacher and what you are optimizing for.
- Response-based distillation. The student matches the teacher's final output layer. Simplest and most common. Works for classification, ranking, and generation.
- Feature-based distillation. The student matches the teacher's intermediate layer activations, not just the output. Useful when the teacher's internal representations carry value the output alone loses.
- Relation-based distillation. The student matches the relationships between examples — how the teacher positions inputs relative to each other rather than absolute outputs.
For large language models, a fourth pattern dominates in practice: sequence-level distillation, where the teacher generates completions for a set of prompts and the student is fine-tuned on those generations. This is the approach most teams reach for today because it needs only API access to the teacher.
How to Run a Distillation Project
A real project moves through five stages. The step-by-step approach covers the mechanics in depth; here is the shape.
- Define the task narrowly. Distillation transfers a capability, not general intelligence. "Classify support tickets into 12 categories" distills well. "Be as smart as the teacher" does not.
- Build a representative prompt set. Collect inputs that match your real production distribution. Coverage of edge cases here directly determines student quality.
- Generate teacher outputs. Run the teacher across your prompt set and capture its responses, and where possible its probabilities or logits.
- Train the student. Fine-tune a smaller base model on the teacher's outputs using a distillation loss or straight supervised fine-tuning for sequence-level setups.
- Evaluate against the teacher. Measure the quality gap on a held-out set, not just aggregate accuracy. Watch the cases where the student and teacher disagree.
When Distillation Pays Off
Distillation is an optimization for scale and latency. It earns its keep under specific conditions.
Strong Candidates
- High request volume where per-call inference cost dominates.
- Latency-sensitive applications where a large model is too slow.
- On-device or edge deployment where the teacher physically cannot fit.
- A narrow, well-defined task rather than open-ended general use.
Poor Candidates
- Low-volume internal tools where the teacher's cost is already negligible.
- Tasks that genuinely require the teacher's full breadth of capability.
- Situations where the teacher itself is not reliably correct — distilling a flawed teacher just produces a cheaper version of the same flaws.
If you are weighing this against simpler options, our best practices guide explains why prompt optimization or a smaller off-the-shelf model often beats a distillation project.
Common Failure Modes
Most failed distillation projects fail for predictable reasons.
- Distribution mismatch. The training prompts do not match production traffic, so the student is sharp on the wrong inputs.
- Teacher errors baked in. The student faithfully reproduces the teacher's mistakes and hallucinations.
- Over-compression. The student is too small to hold the capability, no matter how good the signal.
- Evaluating on the wrong metric. Aggregate accuracy looks fine while a critical subpopulation degrades badly.
Our roundup of 7 common mistakes walks through each of these with the corrective practice.
Measuring Success
Define success before you start. The right comparison is rarely "is the student as good as the teacher" — it almost never is, fully. The right question is "does the student clear the quality bar this task requires, at a cost and latency that justify the project?" Track three numbers: the quality gap to the teacher on a held-out set, the per-request cost reduction, and the latency improvement. A student at 96 percent of teacher quality, 8 times cheaper, is usually a win. A student at 99 percent that is only 1.5 times cheaper usually is not worth the engineering.
Frequently Asked Questions
Is model distillation the same as fine-tuning?
No, but they overlap. Fine-tuning adapts a model using labeled data. Distillation is a specific kind of training where the labels come from a teacher model's outputs rather than human annotation. Sequence-level distillation is essentially fine-tuning on teacher-generated data.
Does the student always lose accuracy?
Almost always, slightly. The goal is to lose a small, acceptable amount of quality in exchange for a large gain in cost and speed. A well-run project keeps the gap small enough that it does not matter for the task at hand.
Can I distill a model I only access through an API?
Yes, using sequence-level distillation. You generate teacher outputs through the API, then fine-tune your student on those outputs. You lose access to the teacher's full probability distribution, but the approach works well in practice and is the most common method for large language models.
How much data do I need?
It depends on task complexity, but distillation is generally more data-efficient than training from scratch because the teacher provides a richer signal per example. Narrow tasks can work with a few thousand well-chosen examples; broad tasks need far more.
Is distillation legal with commercial models?
Check the provider's terms of service. Some prohibit using their model's outputs to train competing models. This is a real constraint, not a technicality, and it has been the subject of public disputes.
Key Takeaways
- Distillation trains a small student model to mimic a large teacher, transferring capability at a fraction of the serving cost.
- The student learns from the teacher's full output distribution, not just hard labels, which is why it beats training from scratch.
- It pays off for high-volume, latency-sensitive, narrow tasks — and rarely for low-volume general ones.
- Most failures come from distribution mismatch, baked-in teacher errors, or over-compression.
- Define your quality bar and cost target before you start, and evaluate the student against the teacher on held-out data.