OPTIMIZATIONModel Efficiency & DeploymentML Calculator
๐Ÿงช

Model Distillation Planning

Plan teacher-to-student model compression. From BERTโ†’DistilBERT to Llama 70Bโ†’8B Minitron. Plan compression ratio, accuracy retention, and token budget.

Concept Fundamentals
Teacher โ†’ Student
Method
Knowledge transfer
Softmax scaling T
Temperature
Soft label smoothing
Loss function
KL Divergence
Distribution matching
Hinton et al. 2015
Paper
Model compression
CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: Distillation reduces model size for deployment. Logit distillation is simplest; feature and attention add quality. Higher temperature softens targets.

How: Student params = Teacher / ratio. Accuracy โ‰ˆ 100 - ฮปยทlogโ‚โ‚€(r)ยท12. Training tokens scale with base ร— (1 + 0.4ยทlogโ‚โ‚€(r)).

  • โ—Logit most common
  • โ—T=2โ€“4 typical
  • โ—DistilBERT 97%
  • โ—5ร— BERTโ†’DistilBERT
๐Ÿงช
Knowledge Distillation

Teacherโ†’Student Compression โ€” Size, Accuracy, Training Tokens

From BERTโ†’DistilBERT to Llama 70Bโ†’8B Minitron. Plan compression ratio, accuracy retention, and token budget.

๐Ÿ“Š Quick Examples โ€” Click to Load

Inputs

e.g., 70 for 70B
e.g., 8.75 for 70Bโ†’8B
T=2โ€“4 typical
optional, for ratio
distillation-plan.sh
CALCULATED
Student Params
8.00B
Compression
8.75ร—
Accuracy
91.9%
Training Tokens
~6.88B
Teacher Size
130.39 GB
Student Size
14.90 GB
Saved
115.48 GB
Share:
Model Distillation Summary
70B โ†’ 8.00B
8.8ร— Compression
92% accuracy retention~6.88B tokens14.90 GB
numbervibe.com/calculators/machine-learning/model-distillation-calculator

Teacher vs Student (Parameters B)

Accuracy vs Compression Ratio

1. Student parameters
P_{student} = \frac{P_{teacher}}{r} = \frac{70 ร— 10^9}{8.75} = 8,000,000,000 \approx 8.00B
2. Accuracy retention (empirical)
Accuracy \approx 100 - \lambda \cdot \log_{10}(r) \cdot 12 \approx 91.9\%
3. Training tokens estimate
T \approx T_{base} \cdot (1 + 0.4 \cdot \log_{10}(r)) \approx 6.88B tokens
4. Size comparison
Teacher: 130.39 GB \quad Student: 14.90 GB \quad \Delta = 115.48 GB

For educational and informational purposes only. Verify with a qualified professional.

๐Ÿค– AI & ML Facts

๐Ÿ“œ

Hinton et al. 2015 introduced knowledge distillation โ€” "dark knowledge" in softmax probabilities

โ€” Hinton 2015

๐Ÿ“

DistilBERT (Sanh 2019) is 40% smaller, 60% faster, retains 97% of BERT performance

โ€” DistilBERT

๐Ÿฆ™

Minitron (Muralidharan 2024) distills Llama 70Bโ†’8B with minimal quality loss

โ€” Minitron

๐ŸŒก๏ธ

Temperature T softens logits: p_i โˆ exp(z_i/T). Higher T = softer, more info

โ€” Theory

๐Ÿ“‹ Key Takeaways

  • โ€ข Knowledge distillation transfers knowledge from a large teacher to a smaller student via soft targets
  • โ€ข Logit distillation (Hinton 2015) is most common; feature and attention distillation add intermediate losses
  • โ€ข Higher temperature softens logits โ€” T=2โ€“4 typical for better dark knowledge transfer
  • โ€ข DistilBERT achieves 97% of BERT performance at 40% size; Minitron compresses Llama 70Bโ†’8B
  • โ€ข Training tokens scale with compression โ€” expect 1โ€“10B+ tokens for LLM distillation

๐Ÿ’ก Did You Know

๐Ÿ“œHinton et al. 2015 introduced knowledge distillation โ€” "dark knowledge" in softmax probabilities
๐Ÿ“DistilBERT (Sanh 2019) is 40% smaller, 60% faster, retains 97% of BERT performance
๐Ÿฆ™Minitron (Muralidharan 2024) distills Llama 70Bโ†’8B with minimal quality loss
๐Ÿค–GPT-4o-mini is likely a distilled version of GPT-4 โ€” smaller, faster, cheaper
๐ŸŒก๏ธTemperature T softens logits: p_i โˆ exp(z_i/T). Higher T = softer, more info
๐Ÿ“ŠFeature distillation matches intermediate layers; attention distillation matches attention maps
โšกTinyLlama 1.1B was trained on 3T tokens โ€” distillation can reduce data needs
๐Ÿ”—Distillation + quantization + pruning = full compression pipeline for deployment

๐Ÿ“– How It Works

1. Logit Distillation

Student learns from teacher softmax outputs (soft targets) at temperature T. KL divergence loss between teacher and student logits.

2. Feature Distillation

Match intermediate layer representations. MSE or cosine loss between teacher and student hidden states.

3. Attention Distillation

Transfer attention patterns. Student attention maps are trained to mimic teacher attention.

4. Temperature

Higher T produces softer probabilities โ€” reveals "dark knowledge" (e.g., "2" vs "7" similarity). T=2โ€“4 typical.

5. Training Data

Often same data as teacher. For LLMs, 1โ€“10B+ tokens. DistilBERT used same corpus as BERT.

๐ŸŽฏ Expert Tips

Start with logit distillation

Simplest and most effective. Add feature/attention loss if quality plateaus.

Temperature 2โ€“4

Softer targets transfer more dark knowledge. T=1 is hard labels.

Moderate compression first

3โ€“5ร— is safe. 10ร—+ requires more data and tuning.

Combine with quantization

Distill first, then quantize for maximum compression.

โš–๏ธ Distillation Methods

MethodWhat is transferredComplexityTypical use
LogitSoftmax outputsLowClassification, generation
FeatureHidden statesMediumBERT-style, embeddings
AttentionAttention mapsHighTransformer alignment

โ“ Frequently Asked Questions

What is knowledge distillation?

Training a smaller student model to mimic a larger teacher. The student learns from soft targets (teacher probabilities) rather than hard labels, capturing "dark knowledge" about class similarities.

Logit vs feature vs attention distillation?

Logit: match output probabilities (simplest). Feature: match hidden layer representations. Attention: match attention maps. Logit is most common; add others for quality gains.

What temperature to use?

T=2โ€“4 typical. Higher T = softer probabilities, more dark knowledge. T=1 gives hard labels. Start with T=2.

How much data for distillation?

Often same as teacher. DistilBERT used BERT corpus. For LLMs: 1โ€“10B+ tokens. More compression โ†’ more data helps.

Distillation vs quantization?

Distillation reduces parameters (smaller architecture). Quantization reduces precision (same params, fewer bits). Use both for max compression.

Can I distill GPT-4?

OpenAI likely does this internally (e.g., GPT-4o-mini). For open models: Llama 70Bโ†’8B (Minitron), BERTโ†’DistilBERT are proven.

Accuracy retention formula?

Empirical: accuracy โ‰ˆ 100 - ฮปยทlogโ‚โ‚€(r)ยท12. ฮป depends on task, method, temperature. Higher compression โ†’ more loss.

How accurate is this calculator?

Size and ratio are exact. Accuracy and tokens are empirical estimates โ€” actual values depend on architecture, data, and tuning.

๐Ÿ“Š Distillation by the Numbers

5ร—
BERTโ†’DistilBERT
8.75ร—
Llama 70Bโ†’8B
97%
DistilBERT accuracy
2015
Hinton paper

โš ๏ธ Disclaimer: This calculator provides estimates for educational and planning purposes. Size and ratio are exact. Accuracy retention and training tokens are empirical approximations โ€” actual values depend on model architecture, distillation method, data quality, hyperparameters, and task. For production, validate with your specific setup. References: Hinton et al. 2015, Sanh 2019 DistilBERT, Muralidharan 2024 Minitron, Zhang 2024 TinyLlama.

๐Ÿ‘ˆ START HERE
โฌ…๏ธJump in and explore the concept!
AI

Related Calculators