Model distillation is one of those terms that sounds more exotic than it is, which is exactly why so much nonsense has accumulated around it. People hear "distillation" and imagine some alchemical process that magically shrinks a giant model into a tiny one with no cost. Others hear it and assume it is just compression by another name. Both groups are wrong, and the gap between the myth and the mechanics matters when you are deciding whether distillation belongs in your stack.
The accurate version is unglamorous. Distillation is a training technique where a smaller "student" model learns to imitate the behavior of a larger "teacher" model, usually by matching the teacher's output distributions rather than just the hard labels. That is it. Everything interesting and everything misleading flows from that one idea. This article walks through the myths that cause the most expensive mistakes and replaces each with what is actually true.
If you want the foundational walkthrough first, start with The Complete Guide to What Is Model Distillation and then come back here to inoculate yourself against the bad takes.
Myth: Distillation Is Just Compression
The most common confusion is treating distillation as interchangeable with quantization or pruning. They are not the same category of thing.
Quantization changes how weights are stored, dropping precision from 16-bit to 8-bit or 4-bit. Pruning removes weights or whole structures from an existing network. Both operate on a fixed model. Distillation, by contrast, trains a new model from scratch or from a checkpoint, using the teacher as a supervisor.
Why the distinction matters
The practical consequence is that distillation can change a model's architecture entirely. You can distill a 70-billion-parameter dense model into a 7-billion-parameter model with a different layer count, a different attention scheme, even a different tokenizer in some setups. Quantization can never do that. When someone promises you a "distilled and quantized" model, those are two separate operations stacked on top of each other, and you should evaluate each on its own.
Myth: The Student Always Matches the Teacher
Marketing copy loves to imply that a distilled model "retains 95 percent of the teacher's performance." Sometimes it does. Often it does not, and the number depends heavily on the task distribution.
Distillation transfers what the teacher demonstrates on the training data you feed it. If your distillation corpus is narrow, the student becomes excellent at that narrow slice and falls off a cliff outside it. The student is not a shrunk copy of the teacher's full capability; it is a copy of the teacher's behavior on the examples you showed it.
The failure mode nobody warns you about
A team distills a strong general model into a small one using mostly customer-support transcripts. The student aces support. Then someone routes a coding question through it and the output is garbage, because no code ever appeared in the distillation set. The teacher could have answered. The student never learned that part. This is why coverage of your distillation data is the single highest-leverage variable, a point we hammer in 7 Common Mistakes with What Is Model Distillation.
Myth: You Need the Teacher's Internal Logits
Classic distillation matches "soft labels," the teacher's full probability distribution over tokens, because those carry more information than a single chosen answer. This led to a persistent belief that you cannot distill from a closed API model that only returns text.
That belief is outdated. Sequence-level or "black-box" distillation works fine: you generate completions from the teacher, then fine-tune the student on those completions as if they were ground truth. You lose the soft-label signal but gain access to any model you can query. Most distillation happening in practice today is this black-box variety, precisely because the strongest teachers are behind APIs.
- White-box distillation: needs logits and weights, richer signal, only works on open teachers.
- Black-box distillation: needs only generated text, works on any teacher, slightly less efficient.
- Hybrid approaches: add reasoning traces or rationales to the student's training targets.
Myth: Distillation Removes the Need for Evaluation
Because the teacher is "smart," people assume the student inherits the teacher's reliability for free. The opposite is true: distillation introduces new ways to fail silently, so you need more evaluation, not less.
The student can confidently reproduce the teacher's mistakes, amplify subtle biases in the generated data, or degrade on edge cases that never showed up in sampling. None of this surfaces unless you measure it against a held-out set the student never trained on. Treating the teacher's reputation as a substitute for your own test suite is how distilled models embarrass teams in production.
Myth: It Is Only for Shrinking Models
Size reduction is the famous use case, but it is not the only one. Distillation is fundamentally a knowledge-transfer technique, and that opens several less obvious plays.
Specialization
You can distill a generalist teacher into a same-size or even larger student that is dramatically better at one domain, by curating the distillation data around that domain. The goal here is not speed but focus.
Stability and cost control
Distilling a frequently-changing API model into a model you own gives you a fixed, version-controlled artifact. You stop paying per token and stop getting surprised by upstream behavior changes. That governance benefit often outweighs the raw latency win.
Reasoning transfer
Newer work distills not just answers but step-by-step reasoning traces, so a small student learns to "think" in a structured way it could never discover on its own. If this interests you, the trajectory is covered in The Future of What Is Model Distillation.
Myth: There Is One Right Way to Do It
People want a single recipe. Distillation is a family of methods with real trade-offs, and choosing wrong wastes weeks.
- Response-based: match final outputs. Simplest, most robust, the default starting point.
- Feature-based: match intermediate hidden states. More powerful, more fragile, requires architectural compatibility.
- Relation-based: match relationships between examples. Niche but useful for retrieval and embedding tasks.
The right choice depends on whether you have white-box access, how aligned the architectures are, and how much engineering time you can spend. For a structured way to make that call, see A Framework for What Is Model Distillation.
Frequently Asked Questions
Is model distillation the same as fine-tuning?
They overlap but are not identical. Fine-tuning adapts a model to new data using ground-truth labels. Distillation specifically uses a teacher model's outputs as the supervision signal. In black-box distillation the line blurs, because you are essentially fine-tuning a student on teacher-generated data, but the intent and data source distinguish them.
Can I distill a closed-source model like a frontier API model?
Yes, through black-box distillation, as long as the provider's terms of service permit it. You query the teacher, collect its outputs, and train your student on those. Always check the license and usage terms first, because many providers explicitly restrict using their outputs to train competing models.
Will a distilled model be cheaper to run?
Usually, if the student is smaller. A smaller model means less memory, faster inference, and lower serving cost. But if you distilled into a same-size model for specialization, your serving cost stays the same. Cheaper inference is a common outcome, not a guaranteed property of distillation itself.
How much data do I need to distill a model?
It depends on how broad the target capability is. Narrow tasks can work with a few thousand high-quality teacher examples. General-purpose students need hundreds of thousands to millions. The quality and coverage of the data matter far more than raw volume.
Does distillation lose the teacher's knowledge?
Some loss is normal. The student learns only what the teacher demonstrated on your data, so anything outside that distribution is not transferred. Careful data curation closes most of the gap, but expecting a perfect copy of a much larger model is unrealistic.
Key Takeaways
- Distillation is a training technique, not a storage trick like quantization or pruning.
- The student copies the teacher's behavior on your data, not the teacher's full capability.
- Black-box distillation lets you learn from closed API teachers using only their text outputs.
- Distilled models need more evaluation, not less, because they can fail silently.
- Beyond shrinking models, distillation enables specialization, cost stability, and reasoning transfer.
- There is no single correct method; response, feature, and relation-based approaches each have trade-offs.