Distillation looks simple from a distance — generate teacher outputs, fine-tune a small model, ship. In practice, the gap between a student that holds up and one that quietly degrades comes down to a short list of avoidable errors. After enough projects, the same mistakes recur. This article names seven of them, explains why each happens, what it costs you, and the corrective practice.
These are not exotic edge cases. They are the bread-and-butter failures that turn a promising project into a model that tests fine and breaks in production. If you are running your first project, read these alongside the step-by-step how-to so you can design them out from the start.
Mistake 1: Training on the Wrong Distribution
This is the most common and most damaging mistake. You build your prompt set from whatever data was convenient — a public dataset, synthetic examples, last year's logs — and it does not match what production actually sees.
Why it happens. Real production data is messy and sometimes restricted, so teams reach for easier substitutes.
What it costs. The student becomes sharp on inputs that never occur and dull on the ones that do. It tests well and fails live.
The fix. Build your prompt set from real production traffic, matched to the actual distribution. If 40 percent of live requests are one category, your training set should reflect that. Synthetic data is a supplement, never the foundation.
Mistake 2: Distilling the Teacher's Errors
The teacher is not perfect. If you copy its outputs blindly, the student faithfully reproduces every hallucination and mistake the teacher made — now at scale and at speed.
Why it happens. Filtering teacher outputs is tedious, so teams skip it and trust the teacher more than they should.
What it costs. Your student inherits a cheaper version of the teacher's flaws. The ceiling on student quality is set by the worst of the teacher's outputs you kept.
The fix. Filter aggressively. For tasks with a checkable answer, verify teacher outputs and drop the wrong ones. For open-ended tasks, sample and review. A smaller clean dataset beats a larger noisy one every time.
Mistake 3: Making the Student Too Small
Chasing maximum cost savings, teams pick a student so small it physically cannot hold the capability. No amount of clean data or clever loss design fixes insufficient capacity.
Why it happens. The cost savings scale with smallness, so there is constant pressure to shrink.
What it costs. You burn a full project cycle to learn the student was never big enough, then have to redo it.
The fix. Start with the smallest base you genuinely believe could work, not the smallest that exists. Train it, and only shrink further if it clears the bar with room to spare. Sizing is empirical — test two candidates and compare.
Mistake 4: Judging Only the Aggregate Score
The student hits 96 percent overall, everyone celebrates, and nobody notices it is failing completely on one critical slice while acing the easy majority.
Why it happens. A single headline number is easy to report and easy to trust.
What it costs. A business-critical category degrades silently. You discover it through customer complaints, not your metrics.
The fix. Evaluate by slice. Break results down by category, customer tier, input type — whatever segments matter for your application — and set a bar for each critical slice independently. The best practices guide covers slice-based evaluation in depth.
Mistake 5: Expecting General Intelligence From a Narrow Student
Someone distills a student for one task, then is shocked when it cannot do a related task it was never trained on. Distillation transfers a specific capability, not the teacher's whole mind.
Why it happens. The student feels like a "mini teacher," so people assume it inherited general ability.
What it costs. Scope creep and disappointment. The student gets pushed onto tasks it was never built for and fails them.
The fix. Treat the student as a specialist. Define its job precisely and keep it there. If you need a second capability, that is a second distillation, or a second slice in your training data. The beginner's guide explains why narrowness is the point.
Mistake 6: Skipping the Shadow Deployment
The student passes offline evaluation, so the team flips production to it overnight. Then real traffic surfaces problems the offline set never contained.
Why it happens. Offline evaluation looks good and shadow deployment takes extra engineering.
What it costs. A live regression that hits real users before you catch it.
The fix. Run the student in shadow mode first — serve teacher results while logging student results on live traffic — then roll out gradually with a fallback to the teacher for low-confidence inputs. Offline evaluation predicts production; it does not replace watching production.
Mistake 7: Treating Distillation as One and Done
A team ships a great student, declares victory, and never touches it again. Six months later, production traffic has drifted and the teacher has improved, and the stale student is now well below the teacher it was meant to track.
Why it happens. Distillation is framed as a project with an end, not a pipeline that needs maintenance.
What it costs. Slow, invisible quality decay that nobody is watching for.
The fix. Monitor for distribution drift and schedule periodic re-distillation. When live traffic moves away from your training set, or the teacher gets a meaningful upgrade, refresh the student. Build the pipeline so re-running it is cheap.
A Pattern Behind the Mistakes
Step back and most of these errors share a root cause: trusting a number that looks good instead of interrogating what it hides. The wrong distribution looks fine until production. Baked-in teacher errors look fine because the teacher is "the smart model." The aggregate score looks fine while a critical slice burns. The stale student looks fine because nobody is watching it. In every case, a comforting headline metric masked a problem one level down.
The corrective mindset is to distrust good news until you have looked underneath it. When your student tests well, your first instinct should be to find where it fails — check the slices, inspect the disagreements with the teacher, sample the teacher outputs you trained on. The teams that avoid these seven mistakes are not smarter; they are more suspicious of their own metrics, and they build that suspicion into their process rather than relying on remembering it in the moment.
Frequently Asked Questions
Which of these mistakes is the most expensive?
Training on the wrong distribution, by a wide margin. It invalidates everything downstream — your student is optimized for inputs that do not occur — and it is invisible until production. Get the data distribution right before worrying about anything else.
How do I know if I am distilling teacher errors?
Sample teacher outputs and check them against ground truth or expert judgment. If the teacher's error rate on your prompts is meaningful, you are passing those errors to the student unless you filter. For checkable tasks, automate the verification.
Can a great loss function fix bad data?
No. This is the hardest lesson. Loss design and hyperparameters are second-order. If your data has the wrong distribution or noisy teacher outputs, no training trick recovers it. Fix the data first.
Is shadow deployment really necessary for small projects?
For anything user-facing, yes. It is cheap insurance against regressions your offline set missed. For a purely internal, low-stakes tool, you can be more relaxed, but even then a gradual rollout costs little.
How often should I re-distill?
It depends on how fast your traffic drifts. Monitor the student's live quality and re-distill when it slips below your bar or when the teacher gets a significant upgrade. Some tasks need it quarterly; stable ones can go much longer.
Key Takeaways
- The deadliest mistake is training on a distribution that does not match production — fix data first.
- Filter teacher outputs to avoid baking the teacher's errors into the student.
- Size the student to the task; too small cannot be rescued by good data.
- Evaluate by slice, not aggregate, and ship behind a shadow deployment and fallback.
- Treat distillation as a maintained pipeline, not a one-time project, and re-distill as traffic drifts.