5 more

OPTIMIZATIONModel Efficiency & DeploymentML Calculator

🧪

Model Distillation Planning

Plan teacher-to-student model compression. From BERT→DistilBERT to Llama 70B→8B Minitron. Plan compression ratio, accuracy retention, and token budget.

Concept Fundamentals

Teacher → Student

Method

Knowledge transfer

Softmax scaling T

Temperature

Soft label smoothing

Loss function

KL Divergence

Distribution matching

Hinton et al. 2015

Paper

Model compression

CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: Distillation reduces model size for deployment. Logit distillation is simplest; feature and attention add quality. Higher temperature softens targets.

How: Student params = Teacher / ratio. Accuracy ≈ 100 - λ·log₁₀(r)·12. Training tokens scale with base × (1 + 0.4·log₁₀(r)).

●Logit most common
●T=2–4 typical
●DistilBERT 97%
●5× BERT→DistilBERT

Sources:Hinton et al. 2015 - Distilling the Knowledge in a Neural NetworkMuralidharan 2024 - Minitron: A Smaller General-Purpose Language Model

🧪

Knowledge Distillation

Teacher→Student Compression — Size, Accuracy, Training Tokens

From BERT→DistilBERT to Llama 70B→8B Minitron. Plan compression ratio, accuracy retention, and token budget.

LLM Training Cost →Model Quantization →

📊 Quick Examples — Click to Load

Inputs

Teacher Params (B)e.g., 70 for 70B

Compression Ratioe.g., 8.75 for 70B→8B

Task Type

Method

TemperatureT=2–4 typical

Student Target (B)optional, for ratio

distillation-plan.sh

CALCULATED

Student Params

8.00B

Compression

8.75×

Accuracy

91.9%

Training Tokens

~6.88B

Teacher Size

130.39 GB

Student Size

14.90 GB

Saved

115.48 GB

Model Distillation Summary

70B → 8.00B

8.8× Compression

92% accuracy retention~6.88B tokens14.90 GB

numbervibe.com/calculators/machine-learning/model-distillation-calculator

Teacher vs Student (Parameters B)

Accuracy vs Compression Ratio

1. Student parameters

P_{student} = \frac{P_{teacher}}{r} = \frac{70 × 10^9}{8.75} = 8,000,000,000 \approx 8.00B

2. Accuracy retention (empirical)

Accuracy \approx 100 - \lambda \cdot \log_{10}(r) \cdot 12 \approx 91.9\%

3. Training tokens estimate

T \approx T_{base} \cdot (1 + 0.4 \cdot \log_{10}(r)) \approx 6.88B tokens

4. Size comparison

Teacher: 130.39 GB \quad Student: 14.90 GB \quad \Delta = 115.48 GB

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

📜

Hinton et al. 2015 introduced knowledge distillation — "dark knowledge" in softmax probabilities

— Hinton 2015

📐

DistilBERT (Sanh 2019) is 40% smaller, 60% faster, retains 97% of BERT performance

— DistilBERT

🦙

Minitron (Muralidharan 2024) distills Llama 70B→8B with minimal quality loss

— Minitron

🌡️

Temperature T softens logits: p_i ∝ exp(z_i/T). Higher T = softer, more info

— Theory

📋 Key Takeaways

• Knowledge distillation transfers knowledge from a large teacher to a smaller student via soft targets
• Logit distillation (Hinton 2015) is most common; feature and attention distillation add intermediate losses
• Higher temperature softens logits — T=2–4 typical for better dark knowledge transfer
• DistilBERT achieves 97% of BERT performance at 40% size; Minitron compresses Llama 70B→8B
• Training tokens scale with compression — expect 1–10B+ tokens for LLM distillation

💡 Did You Know

📜Hinton et al. 2015 introduced knowledge distillation — "dark knowledge" in softmax probabilities

📐DistilBERT (Sanh 2019) is 40% smaller, 60% faster, retains 97% of BERT performance

🦙Minitron (Muralidharan 2024) distills Llama 70B→8B with minimal quality loss

🤖GPT-4o-mini is likely a distilled version of GPT-4 — smaller, faster, cheaper

🌡️Temperature T softens logits: p_i ∝ exp(z_i/T). Higher T = softer, more info

📊Feature distillation matches intermediate layers; attention distillation matches attention maps

⚡TinyLlama 1.1B was trained on 3T tokens — distillation can reduce data needs

🔗Distillation + quantization + pruning = full compression pipeline for deployment

📖 How It Works

1. Logit Distillation

Student learns from teacher softmax outputs (soft targets) at temperature T. KL divergence loss between teacher and student logits.

2. Feature Distillation

Match intermediate layer representations. MSE or cosine loss between teacher and student hidden states.

3. Attention Distillation

Transfer attention patterns. Student attention maps are trained to mimic teacher attention.

4. Temperature

Higher T produces softer probabilities — reveals "dark knowledge" (e.g., "2" vs "7" similarity). T=2–4 typical.

5. Training Data

Often same data as teacher. For LLMs, 1–10B+ tokens. DistilBERT used same corpus as BERT.

🎯 Expert Tips

Start with logit distillation

Simplest and most effective. Add feature/attention loss if quality plateaus.

Temperature 2–4

Softer targets transfer more dark knowledge. T=1 is hard labels.

Moderate compression first

3–5× is safe. 10×+ requires more data and tuning.

Combine with quantization

Distill first, then quantize for maximum compression.

⚖️ Distillation Methods

Method	What is transferred	Complexity	Typical use
Logit	Softmax outputs	Low	Classification, generation
Feature	Hidden states	Medium	BERT-style, embeddings
Attention	Attention maps	High	Transformer alignment

❓ Frequently Asked Questions

What is knowledge distillation?

Training a smaller student model to mimic a larger teacher. The student learns from soft targets (teacher probabilities) rather than hard labels, capturing "dark knowledge" about class similarities.

Logit vs feature vs attention distillation?

Logit: match output probabilities (simplest). Feature: match hidden layer representations. Attention: match attention maps. Logit is most common; add others for quality gains.

What temperature to use?

T=2–4 typical. Higher T = softer probabilities, more dark knowledge. T=1 gives hard labels. Start with T=2.

How much data for distillation?

Often same as teacher. DistilBERT used BERT corpus. For LLMs: 1–10B+ tokens. More compression → more data helps.

Distillation vs quantization?

Distillation reduces parameters (smaller architecture). Quantization reduces precision (same params, fewer bits). Use both for max compression.

Can I distill GPT-4?

OpenAI likely does this internally (e.g., GPT-4o-mini). For open models: Llama 70B→8B (Minitron), BERT→DistilBERT are proven.

Accuracy retention formula?

Empirical: accuracy ≈ 100 - λ·log₁₀(r)·12. λ depends on task, method, temperature. Higher compression → more loss.

How accurate is this calculator?

Size and ratio are exact. Accuracy and tokens are empirical estimates — actual values depend on architecture, data, and tuning.

📊 Distillation by the Numbers

5×

BERT→DistilBERT

8.75×

Llama 70B→8B

97%

DistilBERT accuracy

2015

Hinton paper

📚 Official Sources

Hinton et al. 2015 - Distilling the Knowledge in a Neural Network ↗

Foundational knowledge distillation paper

Updated: 2015

Muralidharan 2024 - Minitron: A Smaller General-Purpose Language Model ↗

Llama 70B→8B distillation

Updated: 2024

Sanh et al. 2019 - DistilBERT, a distilled version of BERT ↗

BERT 340M→66M distillation

Updated: 2019

Zhang et al. 2024 - TinyLlama: An Open-Source Small Language Model ↗

Small LLM training methodology