eazyware
Engineering·May 27, 2024·11 min read

Model distillation: making small models think like big ones

Distillation trains smaller students on teacher outputs. When it works, a 7B distilled model can match a 70B on narrow tasks at 10x the speed.

KR
Kushal R.
Engineering lead

Model distillation — training a smaller student model to mimic a larger teacher — is a cost optimization many production teams quietly rely on. A distilled model running at 10% of the teacher's cost and 80% of its quality is usually a better production deployment than the teacher itself. This post covers when distillation pays off, the patterns that work, and the pitfalls that trap teams attempting it for the first time.

Distillation flow
Distillation — teacher to student Teacher (big) GPT-4, Claude, 70B+ generate labels Dataset 50K-500K examples input → teacher output Student (small) Llama 7B, Mistral 7B fine-tune on dataset ship When distillation works · Narrow task domain (classification, extraction, specific generation style) · Sufficient data (50K+ examples typical, more is better) · Teacher agrees with ground truth (distill errors if teacher is wrong) · Result: 10x speed, 90-95% of teacher quality on the distilled task
Teacher generates high-quality outputs; student trained on teacher responses. Task-specific distillation often outperforms general-purpose small models on the specific task.

Why distill?

Frontier models are capable but expensive. GPT-4o at $2.50/M input tokens, Claude Sonnet 4.5 similar. At volume, these costs dominate.

Small models are cheap but less capable. GPT-4o-mini at $0.15/M input tokens is 17x cheaper but worse on complex tasks.

Distillation bridges the gap. Take a frontier model's outputs on your specific task; train a small model to match them. You get small-model economics with task-specific performance approaching the frontier model.

For narrow production tasks (customer classification, specific extraction patterns, bounded chat workflows), distilled models often match frontier-model quality at a fraction of the cost.

Distillation patterns

Response distillation. Capture teacher responses on production queries (or a representative synthetic dataset). Fine-tune student on (query, teacher response) pairs. Simplest approach; works well for most classification and extraction tasks.

Reasoning distillation. Capture teacher's chain-of-thought reasoning (not just final answer). Train student to produce similar reasoning. Improves student on complex reasoning tasks where final-answer alone is insufficient signal.

Logit distillation. Train student to match teacher's output probability distribution, not just final tokens. Theoretically optimal but requires access to teacher logits (most commercial APIs don't expose them). Feasible with open-source teachers.

Preference distillation. Use teacher to rank student outputs; use rankings as training signal. Useful when the task has multiple valid outputs and exact matching is too strict.

Implementation

Data collection. Sample representative queries from production (or synthesize via the teacher). Generate responses with the teacher. Aim for 5K-50K examples depending on task complexity.

Student model. Start with a small open-source base (Llama 3 8B, Mistral 7B, Phi-3). Fine-tune with LoRA for efficiency. See LoRA post.

Training duration. Most distillation tasks converge in 1-3 epochs. More leads to overfitting on teacher idiosyncrasies.

Eval. Test student on held-out examples. Compare to teacher on the same task. Typical: student reaches 80-95% of teacher quality with 10-20% of the cost.

Scope matters

Narrow tasks distill better than broad ones. 'Classify emails into these 5 categories' distills cleanly; 'be a general-purpose assistant' distills poorly.

If your production system has multiple task types, distill separately per task. A classifier distilled model + an extraction distilled model + a summarization distilled model beats one multitask distilled model in most cases.

Pitfalls

Distilling teacher's mistakes. If teacher hallucinates 5% of the time, student will too — or worse. Filter training data carefully; remove obvious teacher errors.

Distribution shift. Distilled on production Q1 data, deployed in Q3 when user patterns have shifted. Quality drops. Plan for periodic re-distillation as traffic evolves.

Over-relying on distillation. Student matches teacher on training-like queries but degrades on novel ones. The student hasn't absorbed the teacher's full capabilities — only what the training set captured. Have fallback to teacher for uncertain cases.

Cost-benefit decision

Distillation is usually worthwhile when: inference cost exceeds $10K/month AND the task is narrow enough to distill well AND training + maintenance overhead is affordable.

Below the cost threshold, just pay for the teacher. The engineering effort of distillation, fine-tuning, and maintenance is fixed; it only amortizes over significant volume.

See fine-tuning post for the broader decision framework. Distillation is a specific type of fine-tuning with well-defined ROI math.

Some commercial model providers prohibit using their outputs to train competing models. Check terms of service. OpenAI, Anthropic historically have had such restrictions for commercial use. Open-source teacher models (Llama family, Mistral) often have no such restrictions.

Read next
When to fine-tune (and when RAG is fine)
Read next
LoRA and adapters: fine-tune at 1% the cost
Read next
Small models are back — and that changes the economics
Tags
distillationsmall modelsfine-tuning
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request