Seven Ways a Distilled Model Quietly Falls Apart

Distillation looks simple from a distance — generate teacher outputs, fine-tune a small model, ship. In practice, the gap between a student that holds up and one that quietly degrades comes down to a short list of avoidable errors. After enough projects, the same mistakes recur. This article names seven of them, explains why each happens, what it costs you, and the corrective practice.

These are not exotic edge cases. They are the bread-and-butter failures that turn a promising project into a model that tests fine and breaks in production. If you are running your first project, read these alongside the step-by-step how-to so you can design them out from the start.

Mistake 1: Training on the Wrong Distribution

This is the most common and most damaging mistake. You build your prompt set from whatever data was convenient — a public dataset, synthetic examples, last year's logs — and it does not match what production actually sees.

Why it happens. Real production data is messy and sometimes restricted, so teams reach for easier substitutes.

What it costs. The student becomes sharp on inputs that never occur and dull on the ones that do. It tests well and fails live.

The fix. Build your prompt set from real production traffic, matched to the actual distribution. If 40 percent of live requests are one category, your training set should reflect that. Synthetic data is a supplement, never the foundation.

Mistake 2: Distilling the Teacher's Errors

The teacher is not perfect. If you copy its outputs blindly, the student faithfully reproduces every hallucination and mistake the teacher made — now at scale and at speed.

Why it happens. Filtering teacher outputs is tedious, so teams skip it and trust the teacher more than they should.

What it costs. Your student inherits a cheaper version of the teacher's flaws. The ceiling on student quality is set by the worst of the teacher's outputs you kept.

The fix. Filter aggressively. For tasks with a checkable answer, verify teacher outputs and drop the wrong ones. For open-ended tasks, sample and review. A smaller clean dataset beats a larger noisy one every time.

Mistake 3: Making the Student Too Small

Chasing maximum cost savings, teams pick a student so small it physically cannot hold the capability. No amount of clean data or clever loss design fixes insufficient capacity.

Why it happens. The cost savings scale with smallness, so there is constant pressure to shrink.

What it costs. You burn a full project cycle to learn the student was never big enough, then have to redo it.

The fix. Start with the smallest base you genuinely believe could work, not the smallest that exists. Train it, and only shrink further if it clears the bar with room to spare. Sizing is empirical — test two candidates and compare.

Mistake 4: Judging Only the Aggregate Score

The student hits 96 percent overall, everyone celebrates, and nobody notices it is failing completely on one critical slice while acing the easy majority.

Why it happens. A single headline number is easy to report and easy to trust.

What it costs. A business-critical category degrades silently. You discover it through customer complaints, not your metrics.

The fix. Evaluate by slice. Break results down by category, customer tier, input type — whatever segments matter for your application — and set a bar for each critical slice independently. The best practices guide covers slice-based evaluation in depth.

Mistake 5: Expecting General Intelligence From a Narrow Student

Someone distills a student for one task, then is shocked when it cannot do a related task it was never trained on. Distillation transfers a specific capability, not the teacher's whole mind.

Why it happens. The student feels like a "mini teacher," so people assume it inherited general ability.

What it costs. Scope creep and disappointment. The student gets pushed onto tasks it was never built for and fails them.

The fix. Treat the student as a specialist. Define its job precisely and keep it there. If you need a second capability, that is a second distillation, or a second slice in your training data. The beginner's guide explains why narrowness is the point.

Mistake 6: Skipping the Shadow Deployment

The student passes offline evaluation, so the team flips production to it overnight. Then real traffic surfaces problems the offline set never contained.

Why it happens. Offline evaluation looks good and shadow deployment takes extra engineering.

What it costs. A live regression that hits real users before you catch it.

The fix. Run the student in shadow mode first — serve teacher results while logging student results on live traffic — then roll out gradually with a fallback to the teacher for low-confidence inputs. Offline evaluation predicts production; it does not replace watching production.

Mistake 7: Treating Distillation as One and Done

A team ships a great student, declares victory, and never touches it again. Six months later, production traffic has drifted and the teacher has improved, and the stale student is now well below the teacher it was meant to track.

Why it happens. Distillation is framed as a project with an end, not a pipeline that needs maintenance.

What it costs. Slow, invisible quality decay that nobody is watching for.

The fix. Monitor for distribution drift and schedule periodic re-distillation. When live traffic moves away from your training set, or the teacher gets a meaningful upgrade, refresh the student. Build the pipeline so re-running it is cheap.

A Pattern Behind the Mistakes

Step back and most of these errors share a root cause: trusting a number that looks good instead of interrogating what it hides. The wrong distribution looks fine until production. Baked-in teacher errors look fine because the teacher is "the smart model." The aggregate score looks fine while a critical slice burns. The stale student looks fine because nobody is watching it. In every case, a comforting headline metric masked a problem one level down.

The corrective mindset is to distrust good news until you have looked underneath it. When your student tests well, your first instinct should be to find where it fails — check the slices, inspect the disagreements with the teacher, sample the teacher outputs you trained on. The teams that avoid these seven mistakes are not smarter; they are more suspicious of their own metrics, and they build that suspicion into their process rather than relying on remembering it in the moment.

Frequently Asked Questions

Which of these mistakes is the most expensive?

Training on the wrong distribution, by a wide margin. It invalidates everything downstream — your student is optimized for inputs that do not occur — and it is invisible until production. Get the data distribution right before worrying about anything else.

How do I know if I am distilling teacher errors?

Sample teacher outputs and check them against ground truth or expert judgment. If the teacher's error rate on your prompts is meaningful, you are passing those errors to the student unless you filter. For checkable tasks, automate the verification.

Can a great loss function fix bad data?

No. This is the hardest lesson. Loss design and hyperparameters are second-order. If your data has the wrong distribution or noisy teacher outputs, no training trick recovers it. Fix the data first.

Is shadow deployment really necessary for small projects?

For anything user-facing, yes. It is cheap insurance against regressions your offline set missed. For a purely internal, low-stakes tool, you can be more relaxed, but even then a gradual rollout costs little.

How often should I re-distill?

It depends on how fast your traffic drifts. Monitor the student's live quality and re-distill when it slips below your bar or when the teacher gets a significant upgrade. Some tasks need it quarterly; stable ones can go much longer.

Key Takeaways

The deadliest mistake is training on a distribution that does not match production — fix data first.
Filter teacher outputs to avoid baking the teacher's errors into the student.
Size the student to the task; too small cannot be rescued by good data.
Evaluate by slice, not aggregate, and ship behind a shadow deployment and fallback.
Treat distillation as a maintained pipeline, not a one-time project, and re-distill as traffic drifts.

Mistake 1: Training on the Wrong Distribution

Why it happens. Real production data is messy and sometimes restricted, so teams reach for easier substitutes.

What it costs. The student becomes sharp on inputs that never occur and dull on the ones that do. It tests well and fails live.

Mistake 2: Distilling the Teacher's Errors

The teacher is not perfect. If you copy its outputs blindly, the student faithfully reproduces every hallucination and mistake the teacher made — now at scale and at speed.

Why it happens. Filtering teacher outputs is tedious, so teams skip it and trust the teacher more than they should.

What it costs. Your student inherits a cheaper version of the teacher's flaws. The ceiling on student quality is set by the worst of the teacher's outputs you kept.

Mistake 3: Making the Student Too Small

Chasing maximum cost savings, teams pick a student so small it physically cannot hold the capability. No amount of clean data or clever loss design fixes insufficient capacity.

Why it happens. The cost savings scale with smallness, so there is constant pressure to shrink.

What it costs. You burn a full project cycle to learn the student was never big enough, then have to redo it.

Mistake 4: Judging Only the Aggregate Score

The student hits 96 percent overall, everyone celebrates, and nobody notices it is failing completely on one critical slice while acing the easy majority.

Why it happens. A single headline number is easy to report and easy to trust.

What it costs. A business-critical category degrades silently. You discover it through customer complaints, not your metrics.

Mistake 5: Expecting General Intelligence From a Narrow Student

Someone distills a student for one task, then is shocked when it cannot do a related task it was never trained on. Distillation transfers a specific capability, not the teacher's whole mind.

Why it happens. The student feels like a "mini teacher," so people assume it inherited general ability.

What it costs. Scope creep and disappointment. The student gets pushed onto tasks it was never built for and fails them.

Mistake 6: Skipping the Shadow Deployment

The student passes offline evaluation, so the team flips production to it overnight. Then real traffic surfaces problems the offline set never contained.

Why it happens. Offline evaluation looks good and shadow deployment takes extra engineering.

What it costs. A live regression that hits real users before you catch it.

Mistake 7: Treating Distillation as One and Done

Why it happens. Distillation is framed as a project with an end, not a pipeline that needs maintenance.

What it costs. Slow, invisible quality decay that nobody is watching for.

A Pattern Behind the Mistakes

Frequently Asked Questions

Which of these mistakes is the most expensive?

How do I know if I am distilling teacher errors?

Can a great loss function fix bad data?

No. This is the hardest lesson. Loss design and hyperparameters are second-order. If your data has the wrong distribution or noisy teacher outputs, no training trick recovers it. Fix the data first.

Is shadow deployment really necessary for small projects?

How often should I re-distill?

Key Takeaways

The deadliest mistake is training on a distribution that does not match production — fix data first.
Filter teacher outputs to avoid baking the teacher's errors into the student.
Size the student to the task; too small cannot be rescued by good data.
Evaluate by slice, not aggregate, and ship behind a shadow deployment and fallback.
Treat distillation as a maintained pipeline, not a one-time project, and re-distill as traffic drifts.

Seven Ways a Distilled Model Quietly Falls Apart

Mistake 1: Training on the Wrong Distribution

Mistake 2: Distilling the Teacher's Errors

Mistake 3: Making the Student Too Small

Mistake 4: Judging Only the Aggregate Score

Mistake 5: Expecting General Intelligence From a Narrow Student

Mistake 6: Skipping the Shadow Deployment

Mistake 7: Treating Distillation as One and Done

A Pattern Behind the Mistakes

Frequently Asked Questions

Which of these mistakes is the most expensive?

How do I know if I am distilling teacher errors?

Can a great loss function fix bad data?

Is shadow deployment really necessary for small projects?

How often should I re-distill?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?

Seven Ways a Distilled Model Quietly Falls Apart

Mistake 1: Training on the Wrong Distribution

Mistake 2: Distilling the Teacher's Errors

Mistake 3: Making the Student Too Small

Mistake 4: Judging Only the Aggregate Score

Mistake 5: Expecting General Intelligence From a Narrow Student

Mistake 6: Skipping the Shadow Deployment

Mistake 7: Treating Distillation as One and Done

A Pattern Behind the Mistakes

Frequently Asked Questions

Which of these mistakes is the most expensive?

How do I know if I am distilling teacher errors?

Can a great loss function fix bad data?

Is shadow deployment really necessary for small projects?

How often should I re-distill?

Key Takeaways

Agency Script Editorial

Related Articles

Prompt Quality Decides Whether AI Earns Its Keep

Counting the Real Cost of Every Token You Send

Rolling Out AI Hallucinations Across a Team

Ready to certify your AI capability?