Abstract explanations of distillation only go so far. What makes the concept click is seeing it applied to specific problems, with the detail of why it worked in one case and failed in another. This article walks through concrete use cases across different domains. For each, we describe the setup, what made distillation a fit or a mismatch, and the lesson.
A note on what follows: these are representative scenarios drawn from common patterns, illustrating typical trade-offs rather than any single company's confidential results. The point is the shape of the decision, not a leaderboard. If you want the underlying mechanics first, start with the complete guide.
Use Case 1: Support Ticket Triage
A company routes incoming support tickets to the right team using a large language model. The model is accurate but expensive at high volume, and triage runs on every single ticket.
Why It Worked
The task is narrow and well-defined: classify a ticket into one of a fixed set of queues. There were years of historical tickets with known correct routing, so the production distribution was easy to match. The teacher's outputs were checkable against actual routing outcomes, so filtering was straightforward. A small student preserved nearly all the routing accuracy at a fraction of the per-ticket cost, which mattered because the volume was enormous.
The lesson: high volume plus a narrow, checkable task is the ideal distillation profile.
Use Case 2: On-Device Translation
A mobile app needs to translate text offline, with no network call. The strongest translation models are far too large to ship on a phone.
Why It Worked, With a Caveat
Here distillation was not an optimization — it was the only option, because the teacher physically cannot run on the device. The student was distilled from a large translation model down to something that fits in a phone's memory budget. Quality dropped noticeably compared to the teacher, but the alternative was no offline translation at all, so the trade was easy.
The caveat: the team had to accept a real quality gap. On-device distillation often means compressing harder than you would like, and you live with the result. The framework article covers how to reason about that compression ceiling.
Use Case 3: Search Result Ranking
A search system uses a heavy model to re-rank candidate results for relevance. Re-ranking is latency-critical — it sits directly in the user's path — and a slow model degrades the whole experience.
Why It Worked
The driver here was latency, not just cost. The large re-ranker produced excellent rankings but was too slow to run inline. A distilled student ran fast enough to stay in the request path while preserving most of the ranking quality. Because ranking quality could be measured against click and relevance signals, the team could evaluate the student precisely.
The lesson: distillation is as much a latency tool as a cost tool. When a great model is too slow to use inline, a fast student can make it deployable.
There is a subtlety worth noting in ranking distillation specifically. The team did not need the student to reproduce the teacher's exact relevance scores — only its ordering of results. That relaxation made the distillation easier than a pure regression task, because the student could be slightly off on absolute scores as long as it ranked the right documents above the wrong ones. Recognizing what your task actually requires — exact outputs versus correct ordering versus correct top choice — lets you set a looser, more achievable target and often a smaller student.
Use Case 4: Content Moderation at Scale
A platform moderates user content with a large model. Volume is massive and the cost per item is small but adds up across billions of items.
Where It Got Tricky
This one is instructive because it nearly failed. The aggregate accuracy of the student looked excellent, but evaluation by slice revealed it was much weaker on rare, high-severity categories — exactly the ones that matter most. The training distribution had too few examples of the dangerous edge cases.
The Fix
The team oversampled the rare high-severity categories in the training set and added a teacher fallback for low-confidence moderation decisions. After that, the student held up on the slices that mattered. This is a textbook case of why you evaluate by slice rather than average.
Use Case 5: A Case Where Distillation Was the Wrong Call
Not every example is a success. A small team wanted to distill a large model for an internal research-assistant tool used a few dozen times a day by a handful of analysts.
Why It Did Not Pay Off
The teacher's total cost at that volume was already trivial. The task was also broad — open-ended research questions — which meant the student would need to be large to preserve quality, eroding the savings. The engineering effort to build and maintain a distillation pipeline dwarfed any cost it could recover.
The right answer was to do nothing: keep calling the teacher directly. The lesson is that distillation is justified by scale and narrowness. Without both, simpler options win. Our best practices article makes the case for always running a "do nothing" baseline first.
Use Case 6: Structured Extraction From Documents
A company extracts structured fields — dates, amounts, party names — from semi-structured documents. A large model handled the messiness well but cost too much to run on every document at their throughput.
Why It Worked
Structured extraction is a near-ideal distillation target. The output is a defined schema, so correctness is mostly checkable: a date is right or wrong, an amount matches or does not. That made teacher filtering almost fully automatable. The team verified extracted fields against known-good records, dropped the teacher's errors, and trained the student on clean, schema-conformant outputs. Because the task was so well-bounded, a small student preserved high field-level accuracy.
The lesson: tasks with a checkable, structured output are the easiest to distill well, because the same structure that defines correctness also automates your filtering and evaluation.
What the Examples Have in Common
Pull these together and a pattern emerges. The wins shared three traits: high volume or strict latency needs, a narrow and well-defined task, and a way to measure and filter teacher quality. The struggles came from broad tasks, low volume, or distributions that missed the critical edge cases. When you are evaluating your own problem, score it against those traits before committing.
Notice too that several of the wins relaxed what "match the teacher" meant. The ranking case needed only correct ordering; the extraction case needed only schema-conformant fields; the triage case needed only the right queue. None of them required the student to reproduce the teacher's full output verbatim. Identifying the minimal thing your task actually needs is a recurring lever — it lowers the bar the student must clear and, with it, the size and cost of the student you can get away with.
Frequently Asked Questions
What kind of task distills best?
A narrow, well-defined task with high request volume and a measurable output. Classification, routing, ranking, and structured extraction distill cleanly because you can match the distribution and check the teacher's outputs.
When is distillation a mismatch?
Low-volume tasks where the teacher is already cheap, broad open-ended tasks that need the teacher's full capability, and any case where you cannot build a training set that matches production. In those situations, simpler alternatives usually win.
Why did the moderation example almost fail?
The training distribution underrepresented rare, high-severity categories, so the student was weak exactly where it mattered most while looking strong in aggregate. Oversampling the rare cases and adding a teacher fallback fixed it.
Is on-device distillation different from cost-driven distillation?
In motivation, yes. On-device distillation is often forced — the teacher cannot run on the hardware at all — so you accept a larger quality gap than you would for a pure cost optimization. The technique is the same; the tolerance for quality loss is different.
Can I combine distillation with a fallback to the teacher?
Yes, and it is a strong pattern. Serve the cheap student for confident cases and route low-confidence or high-stakes inputs back to the teacher. The content moderation example used exactly this hybrid.
Key Takeaways
- The best distillation use cases are narrow, high-volume, and measurable — support triage, ranking, structured classification.
- On-device deployment often forces distillation because the teacher cannot fit the hardware, at the cost of a larger quality gap.
- Latency-critical inline tasks like search re-ranking benefit as much from the student's speed as from its lower cost.
- Slice-based evaluation catches failures on rare but critical categories that aggregate accuracy hides.
- Low-volume or broad tasks are often better served by doing nothing and calling the teacher directly.