Model Distillation Planning
Plan teacher-to-student model compression. From BERTโDistilBERT to Llama 70Bโ8B Minitron. Plan compression ratio, accuracy retention, and token budget.
Why This ML Metric Matters
Why: Distillation reduces model size for deployment. Logit distillation is simplest; feature and attention add quality. Higher temperature softens targets.
How: Student params = Teacher / ratio. Accuracy โ 100 - ฮปยทlogโโ(r)ยท12. Training tokens scale with base ร (1 + 0.4ยทlogโโ(r)).
- โLogit most common
- โT=2โ4 typical
- โDistilBERT 97%
- โ5ร BERTโDistilBERT
TeacherโStudent Compression โ Size, Accuracy, Training Tokens
From BERTโDistilBERT to Llama 70Bโ8B Minitron. Plan compression ratio, accuracy retention, and token budget.
๐ Quick Examples โ Click to Load
Inputs
Teacher vs Student (Parameters B)
Accuracy vs Compression Ratio
For educational and informational purposes only. Verify with a qualified professional.
๐ค AI & ML Facts
Hinton et al. 2015 introduced knowledge distillation โ "dark knowledge" in softmax probabilities
โ Hinton 2015
DistilBERT (Sanh 2019) is 40% smaller, 60% faster, retains 97% of BERT performance
โ DistilBERT
Minitron (Muralidharan 2024) distills Llama 70Bโ8B with minimal quality loss
โ Minitron
Temperature T softens logits: p_i โ exp(z_i/T). Higher T = softer, more info
โ Theory
๐ Key Takeaways
- โข Knowledge distillation transfers knowledge from a large teacher to a smaller student via soft targets
- โข Logit distillation (Hinton 2015) is most common; feature and attention distillation add intermediate losses
- โข Higher temperature softens logits โ T=2โ4 typical for better dark knowledge transfer
- โข DistilBERT achieves 97% of BERT performance at 40% size; Minitron compresses Llama 70Bโ8B
- โข Training tokens scale with compression โ expect 1โ10B+ tokens for LLM distillation
๐ก Did You Know
๐ How It Works
1. Logit Distillation
Student learns from teacher softmax outputs (soft targets) at temperature T. KL divergence loss between teacher and student logits.
2. Feature Distillation
Match intermediate layer representations. MSE or cosine loss between teacher and student hidden states.
3. Attention Distillation
Transfer attention patterns. Student attention maps are trained to mimic teacher attention.
4. Temperature
Higher T produces softer probabilities โ reveals "dark knowledge" (e.g., "2" vs "7" similarity). T=2โ4 typical.
5. Training Data
Often same data as teacher. For LLMs, 1โ10B+ tokens. DistilBERT used same corpus as BERT.
๐ฏ Expert Tips
Start with logit distillation
Simplest and most effective. Add feature/attention loss if quality plateaus.
Temperature 2โ4
Softer targets transfer more dark knowledge. T=1 is hard labels.
Moderate compression first
3โ5ร is safe. 10ร+ requires more data and tuning.
Combine with quantization
Distill first, then quantize for maximum compression.
โ๏ธ Distillation Methods
| Method | What is transferred | Complexity | Typical use |
|---|---|---|---|
| Logit | Softmax outputs | Low | Classification, generation |
| Feature | Hidden states | Medium | BERT-style, embeddings |
| Attention | Attention maps | High | Transformer alignment |
โ Frequently Asked Questions
What is knowledge distillation?
Training a smaller student model to mimic a larger teacher. The student learns from soft targets (teacher probabilities) rather than hard labels, capturing "dark knowledge" about class similarities.
Logit vs feature vs attention distillation?
Logit: match output probabilities (simplest). Feature: match hidden layer representations. Attention: match attention maps. Logit is most common; add others for quality gains.
What temperature to use?
T=2โ4 typical. Higher T = softer probabilities, more dark knowledge. T=1 gives hard labels. Start with T=2.
How much data for distillation?
Often same as teacher. DistilBERT used BERT corpus. For LLMs: 1โ10B+ tokens. More compression โ more data helps.
Distillation vs quantization?
Distillation reduces parameters (smaller architecture). Quantization reduces precision (same params, fewer bits). Use both for max compression.
Can I distill GPT-4?
OpenAI likely does this internally (e.g., GPT-4o-mini). For open models: Llama 70Bโ8B (Minitron), BERTโDistilBERT are proven.
Accuracy retention formula?
Empirical: accuracy โ 100 - ฮปยทlogโโ(r)ยท12. ฮป depends on task, method, temperature. Higher compression โ more loss.
How accurate is this calculator?
Size and ratio are exact. Accuracy and tokens are empirical estimates โ actual values depend on architecture, data, and tuning.
๐ Distillation by the Numbers
๐ Official Sources
โ ๏ธ Disclaimer: This calculator provides estimates for educational and planning purposes. Size and ratio are exact. Accuracy retention and training tokens are empirical approximations โ actual values depend on model architecture, distillation method, data quality, hyperparameters, and task. For production, validate with your specific setup. References: Hinton et al. 2015, Sanh 2019 DistilBERT, Muralidharan 2024 Minitron, Zhang 2024 TinyLlama.
Related Calculators
Model Quantization Tradeoff Calculator
Compare GPTQ, AWQ, and GGUF quantization methods. Calculate memory savings, speed gains, and accuracy tradeoffs.
Machine LearningActivation Memory Calculator
Estimate activation memory with and without gradient checkpointing. Based on NVIDIA selective recomputation research.
Machine LearningContext Window Scaling Cost Calculator
Analyze quadratic attention scaling costs. Compare standard vs Flash Attention memory and throughput at different context lengths.
Machine LearningGradient Accumulation Steps Calculator
Calculate accumulation steps to achieve target effective batch size on limited GPU memory. Based on DeepSpeed ZeRO research.
Machine LearningInference Throughput & Latency Calculator
Estimate tokens/sec, time-to-first-token, and inter-token latency for LLM serving on various GPU configurations.
Machine LearningKV Cache Size Estimator
Calculate KV cache memory for LLM inference with MHA, MQA, and GQA attention types. Based on PagedAttention research.
Machine Learning