Scope and Ship a Distillation Project Without Skipping Ahead

This is a tool, not an essay. Use it as a working checklist when you scope, build, and ship a distillation project. Each item has a one-line justification so you know why it is on the list, not just that it is. Work through the phases in order; skipping ahead is how projects fail.

If any item is unfamiliar, the linked articles go deeper. For the full conceptual grounding, start with the complete guide. For the build mechanics, see the step-by-step how-to.

Phase 1: Scoping and Justification

Do this before you write any code. Most wasted projects are wasted here.

[ ] The task is narrow and well-defined. Distillation transfers a specific capability; broad tasks need a large student that erodes the savings.
[ ] The volume or latency justifies the project. High request volume or strict latency is what makes distillation pay back; low-volume tasks rarely do.
[ ] You ran the "do nothing" baseline. A smaller off-the-shelf model might already clear the bar with zero engineering.
[ ] You tried prompt optimization on the small model. Better instructions or few-shot examples sometimes close the gap for free.
[ ] You considered quantizing the teacher. Quantization can cut cost without training a new model at all.
[ ] You estimated payback period. If the engineering cost outweighs the projected savings, do not distill.

If a cheaper option clears your targets, stop here. That is a win, not a failure.

Phase 2: Defining Success

Set the bar before you have a model so you cannot rationalize a weak result.

[ ] You wrote down a measurable quality bar. A metric and threshold you commit to before training.
[ ] You set per-slice bars for critical segments. Aggregate accuracy hides failures on rare but important categories.
[ ] You set explicit cost and latency targets. The whole point is cost and speed; quantify what success means.
[ ] You defined an acceptable quality gap to the teacher. The student will lose a little quality; decide how much is acceptable now.

Phase 3: Data Preparation

This phase determines the outcome. Spend the bulk of your effort here.

[ ] Prompts come from real production traffic. Synthetic data is a supplement, never the foundation.
[ ] The training distribution matches live traffic. If a category is 30 percent of traffic, it is roughly 30 percent of your data.
[ ] Rare critical categories are adequately represented. Oversample them if needed; they are where students silently fail.
[ ] Edge cases are covered. The student is dull exactly where your prompts are sparse.
[ ] A held-out test set is reserved. Drawn from the same distribution, never seen during training.

Phase 4: Teacher Output Generation

The student cannot exceed the quality of the outputs you train it on.

[ ] Teacher outputs are captured fully. Including probabilities or logprobs if the API exposes them.
[ ] Outputs are filtered before training. Verify checkable outputs and drop the wrong ones; sample-review open-ended ones.
[ ] You accepted a smaller, cleaner dataset. Clean and smaller beats noisy and larger, every time.
[ ] You checked the teacher's terms of service. Some providers prohibit training competing models on their outputs.

That last item is a real constraint in 2026, not a formality. Verify it before you generate anything.

Phase 5: Training the Student

The most mechanical phase. Resist over-engineering it.

[ ] You tested at least two student sizes. The smallest plausible one and one step larger; size empirically, not by vibes.
[ ] Default hyperparameters first. Tune only if data is solid and metrics demand it; hyperparameters are second-order.
[ ] You watched for overfitting. If held-out quality stalls while train quality climbs, stop early.
[ ] You kept checkpoints. So you can roll back to the best one.

Phase 6: Evaluation

The step that prevents bad ships.

[ ] You measured the gap to the teacher on held-out data. Not just an absolute score in isolation.
[ ] You evaluated by slice, not average. Each critical segment against its own bar.
[ ] You inspected disagreements directly. Where student and teacher differ tells you more than the aggregate.
[ ] You checked the production-critical slices explicitly. The segments the business cannot afford to get wrong.

The common mistakes article exists because teams skip this phase. Do not.

Phase 7: Rollout and Maintenance

Shipping is the start, not the end.

[ ] You ran the student in shadow mode. Logging its outputs on live traffic while users still get the teacher.
[ ] You rolled out gradually. 5 percent, then 25 percent, then full — not all at once.
[ ] You kept a fallback to the teacher. Route low-confidence or high-stakes inputs back to the teacher.
[ ] You set up drift monitoring. When live traffic moves away from your training set, quality erodes.
[ ] You scheduled re-distillation. Refresh when the student slips or the teacher gets a meaningful upgrade.

The best practices article expands on why distillation is a maintained pipeline rather than a one-time project.

How to Use This Checklist

Do not treat the boxes as a formality to tick on the way to training. Each phase gates the next. If you cannot honestly check every box in Phase 1, you have not justified the project and you should not move to Phase 2. If Phase 3's data boxes are not all checked, no amount of training will produce a reliable student, so stop and fix the data. The value of a checklist is that it forces you to confront the uncomfortable gaps before they become expensive, rather than discovering them in production.

Run the checklist as a living document across the project, not once at the start. Revisit Phase 6 every time you retrain, and revisit Phase 7 continuously once you ship. A printed copy on the wall, or a shared doc the whole team checks off together, turns these items from individual good intentions into a shared standard that survives deadline pressure. The single most common way teams misuse it is skipping straight to Phase 5 because training feels like the "real" work — resist that, because the real work is in Phases 1 through 4.

Frequently Asked Questions

Can I skip Phase 1 if I already know I want to distill?

No. Phase 1 is where the most expensive mistakes are prevented. Even if you are confident, running the cheaper baselines either confirms the decision or saves you a project. It costs little and protects a lot.

Which phase deserves the most time?

Phase 3, data preparation, by a wide margin. The student learns exactly what you show it, so distribution matching and coverage drive the outcome more than anything in training. Budget your effort accordingly.

Do I really need shadow deployment for an internal tool?

For low-stakes internal tools you can relax it, but a gradual rollout still costs almost nothing. For anything user-facing, shadow mode is essential insurance against regressions your offline set missed.

Why is checking terms of service on the checklist?

Because some commercial model providers prohibit using their outputs to train competing models, and violating that has real consequences. It is a five-minute check that prevents a serious problem, so it earns its place.

How often do I repeat Phase 7's maintenance items?

Continuously for monitoring, and on-demand for re-distillation. Watch the student's live quality against your bar and re-distill when it slips or when production traffic drifts. Some tasks need it quarterly; stable ones go longer.

Key Takeaways

Scope and justify before coding — run cheaper baselines and confirm the volume justifies the project.
Set measurable quality, cost, and per-slice bars before you have a model.
Spend most of your effort on data: real production prompts, matched distribution, covered edge cases.
Filter teacher outputs, size the student empirically, and evaluate by slice rather than average.
Ship gradually behind a shadow and a fallback, then monitor for drift and re-distill on a schedule.

If any item is unfamiliar, the linked articles go deeper. For the full conceptual grounding, start with the complete guide. For the build mechanics, see the step-by-step how-to.

Phase 1: Scoping and Justification

Do this before you write any code. Most wasted projects are wasted here.

[ ] The task is narrow and well-defined. Distillation transfers a specific capability; broad tasks need a large student that erodes the savings.
[ ] The volume or latency justifies the project. High request volume or strict latency is what makes distillation pay back; low-volume tasks rarely do.
[ ] You ran the "do nothing" baseline. A smaller off-the-shelf model might already clear the bar with zero engineering.
[ ] You tried prompt optimization on the small model. Better instructions or few-shot examples sometimes close the gap for free.
[ ] You considered quantizing the teacher. Quantization can cut cost without training a new model at all.
[ ] You estimated payback period. If the engineering cost outweighs the projected savings, do not distill.

If a cheaper option clears your targets, stop here. That is a win, not a failure.

Phase 2: Defining Success

Set the bar before you have a model so you cannot rationalize a weak result.

[ ] You wrote down a measurable quality bar. A metric and threshold you commit to before training.
[ ] You set per-slice bars for critical segments. Aggregate accuracy hides failures on rare but important categories.
[ ] You set explicit cost and latency targets. The whole point is cost and speed; quantify what success means.
[ ] You defined an acceptable quality gap to the teacher. The student will lose a little quality; decide how much is acceptable now.

Phase 3: Data Preparation

This phase determines the outcome. Spend the bulk of your effort here.

[ ] Prompts come from real production traffic. Synthetic data is a supplement, never the foundation.
[ ] The training distribution matches live traffic. If a category is 30 percent of traffic, it is roughly 30 percent of your data.
[ ] Rare critical categories are adequately represented. Oversample them if needed; they are where students silently fail.
[ ] Edge cases are covered. The student is dull exactly where your prompts are sparse.
[ ] A held-out test set is reserved. Drawn from the same distribution, never seen during training.

Phase 4: Teacher Output Generation

The student cannot exceed the quality of the outputs you train it on.

[ ] Teacher outputs are captured fully. Including probabilities or logprobs if the API exposes them.
[ ] Outputs are filtered before training. Verify checkable outputs and drop the wrong ones; sample-review open-ended ones.
[ ] You accepted a smaller, cleaner dataset. Clean and smaller beats noisy and larger, every time.
[ ] You checked the teacher's terms of service. Some providers prohibit training competing models on their outputs.

That last item is a real constraint in 2026, not a formality. Verify it before you generate anything.

Phase 5: Training the Student

The most mechanical phase. Resist over-engineering it.

[ ] You tested at least two student sizes. The smallest plausible one and one step larger; size empirically, not by vibes.
[ ] Default hyperparameters first. Tune only if data is solid and metrics demand it; hyperparameters are second-order.
[ ] You watched for overfitting. If held-out quality stalls while train quality climbs, stop early.
[ ] You kept checkpoints. So you can roll back to the best one.

Phase 6: Evaluation

The step that prevents bad ships.

[ ] You measured the gap to the teacher on held-out data. Not just an absolute score in isolation.
[ ] You evaluated by slice, not average. Each critical segment against its own bar.
[ ] You inspected disagreements directly. Where student and teacher differ tells you more than the aggregate.
[ ] You checked the production-critical slices explicitly. The segments the business cannot afford to get wrong.

The common mistakes article exists because teams skip this phase. Do not.

Phase 7: Rollout and Maintenance

Shipping is the start, not the end.

[ ] You ran the student in shadow mode. Logging its outputs on live traffic while users still get the teacher.
[ ] You rolled out gradually. 5 percent, then 25 percent, then full — not all at once.
[ ] You kept a fallback to the teacher. Route low-confidence or high-stakes inputs back to the teacher.
[ ] You set up drift monitoring. When live traffic moves away from your training set, quality erodes.
[ ] You scheduled re-distillation. Refresh when the student slips or the teacher gets a meaningful upgrade.

The best practices article expands on why distillation is a maintained pipeline rather than a one-time project.

How to Use This Checklist

Frequently Asked Questions

Can I skip Phase 1 if I already know I want to distill?

Which phase deserves the most time?

Do I really need shadow deployment for an internal tool?

Why is checking terms of service on the checklist?

How often do I repeat Phase 7's maintenance items?

Key Takeaways

Scope and justify before coding — run cheaper baselines and confirm the volume justifies the project.
Set measurable quality, cost, and per-slice bars before you have a model.
Spend most of your effort on data: real production prompts, matched distribution, covered edge cases.
Filter teacher outputs, size the student empirically, and evaluate by slice rather than average.
Ship gradually behind a shadow and a fallback, then monitor for drift and re-distill on a schedule.

Scope and Ship a Distillation Project Without Skipping Ahead

Phase 1: Scoping and Justification

Phase 2: Defining Success

Phase 3: Data Preparation

Phase 4: Teacher Output Generation

Phase 5: Training the Student

Phase 6: Evaluation

Phase 7: Rollout and Maintenance

How to Use This Checklist

Frequently Asked Questions

Can I skip Phase 1 if I already know I want to distill?

Which phase deserves the most time?

Do I really need shadow deployment for an internal tool?

Why is checking terms of service on the checklist?

How often do I repeat Phase 7's maintenance items?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

Scope and Ship a Distillation Project Without Skipping Ahead

Phase 1: Scoping and Justification

Phase 2: Defining Success

Phase 3: Data Preparation

Phase 4: Teacher Output Generation

Phase 5: Training the Student

Phase 6: Evaluation

Phase 7: Rollout and Maintenance

How to Use This Checklist

Frequently Asked Questions

Can I skip Phase 1 if I already know I want to distill?

Which phase deserves the most time?

Do I really need shadow deployment for an internal tool?

Why is checking terms of service on the checklist?

How often do I repeat Phase 7's maintenance items?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?