Chinchilla Scaling Laws
Compute-optimal model size and training tokens. DeepMind 2022: D=20P, C=6PD. Smaller models trained on more data beat larger undertrained models.
Why This ML Metric Matters
Why: For a given compute budget, training smaller models on more tokens yields better performance than undertraining larger models.
How: The calculator applies D=20P (optimal tokens per parameter) and C=6PD (compute formula) to derive optimal model size from FLOPs or GPU hours.
D=20P: More Data Beats Bigger Models
DeepMind's Chinchilla (2022) showed that for a given compute budget, smaller models trained on more tokens outperform larger undertrained models. Find your compute-optimal size here.
๐ Quick Examples โ Click to Load
Inputs
Scaling Law Curve (Loss vs Compute)
Optimal Frontier (Params vs Tokens)
For educational and informational purposes only. Verify with a qualified professional.
๐ค AI & ML Facts
Chinchilla 2022: 70B model trained on 1.4T tokens outperformed Gopher 280B trained on 300B
โ Hoffmann et al.
D=20P: optimal tokens = 20 ร parameters. C=6PD: compute = 6 ร P ร D
โ Chinchilla
Sardana 2024: inference-adjusted scaling may favor smaller models trained longer
โ Beyond Chinchilla
Kaplan 2020: loss follows power law L โ 1/P^ฮฑ + 1/D^ฮฒ
โ OpenAI scaling
Chinchilla โAPI sheetโ (March 2026)
DeepMind Chinchilla (2022): for a fixed training FLOP budget, train smaller models on more tokens โ optimal tokens โ 20ร parameters (D = 20P). Total training compute โ 6PD FLOPs (forward+backward heuristic). Frontier labs often exceed 20ร in practice for multimodal or continued pre-training โ treat as a baseline, not a contractual SLA.
| Formula | Role |
|---|---|
| D = 20P | Target training tokens D for parameter count P |
| C โ 6PD | Order-of-magnitude training FLOPs |
| Inference trade-offs | See Sardana et al. โ smaller models can win at deployment even if training-optimal size differs |
๐ Key Takeaways
- โข D = 20P: For compute-optimal training, use ~20 tokens per parameter (Chinchilla 2022)
- โข Undertrained models waste compute: larger models trained on fewer tokens underperform smaller, well-trained models
- โข Chinchilla 70B beat Gopher 280B with 4ร fewer parameters by training on 4ร more data
- โข Inference considerations: for high-inference workloads, train smaller + longer (Sardana et al. 2024)
๐ก Did You Know
๐ How It Works
1. Chinchilla Scaling Law
Hoffmann et al. 2022 found that for a given compute budget C, the optimal model uses D โ 20P tokens. This implies C = 6PD = 120Pยฒ, so P_opt = โ(C/120).
2. Compute Budget
Enter FLOPs directly or GPU-hours ร GPU count ร throughput ร utilization. We use 40% utilization as typical for large-scale training.
3. Mode 2: Model โ Tokens
Given a model size P, Chinchilla-optimal tokens are D = 20P. For high-inference workloads, Sardana et al. suggest training smaller models for longer.
4. Loss Estimation
We use a simplified Kaplan-style power law: L โ A/P^ฮฑ + B/D^ฮฒ. The exact coefficients vary by model family and data quality.
๐ฏ Expert Tips
Stick to D=20P for pre-training
Unless you have evidence otherwise, Chinchilla ratio is a strong default for compute-optimal training.
Inference-heavy? Train smaller
If inference cost dominates, consider 50% params with 2ร tokens per Sardana et al. 2024.
Data quality matters
Chinchilla assumes high-quality data. Low-quality data may need different scaling.
Validate with small runs
Run 1% scale experiments to validate scaling assumptions before full training.
โ๏ธ This Calculator vs. Other Tools
| Feature | This Calculator | Manual | Spreadsheet | Papers Only |
|---|---|---|---|---|
| D=20P Chinchilla formula | โ | โ ๏ธ | โ ๏ธ | โ |
| Compute โ Model Size | โ | โ | โ ๏ธ | โ |
| Model โ Optimal Tokens | โ | โ ๏ธ | โ ๏ธ | โ |
| GPU-hours to FLOPs | โ | โ | โ ๏ธ | โ |
| Inference-adjusted (Sardana) | โ | โ | โ | โ ๏ธ |
| Scaling law charts | โ | โ | โ | โ |
| Example presets | โ | โ | โ | โ |
| Copy & share | โ | โ | โ | โ |
โ Frequently Asked Questions
What is the Chinchilla scaling law?
Chinchilla (Hoffmann et al. 2022) found that for compute-optimal training, use D โ 20P tokens per parameter. Larger models trained on fewer tokens are undertrained and underperform.
Why did Chinchilla 70B beat Gopher 280B?
Chinchilla used 4ร more training data (1.4T vs 300B tokens) with 4ร fewer parameters. Better data scaling beat raw parameter count.
What is the C=6PD formula?
C = 6 ร P ร D estimates total FLOPs for transformer training. Each parameter sees each token ~6 times (forward + backward passes).
When should I use inference-adjusted scaling?
When inference requests greatly exceed training tokens (e.g., 1B+ requests for a 7B model). Sardana et al. 2024 suggest training smaller models for longer.
Does D=20P apply to code or images?
Chinchilla was trained on text. Code and images may have different optimal ratios; use as a starting point and validate.
How do I convert GPU-hours to FLOPs?
FLOPs = GPU-hours ร GPU count ร throughput (TFLOP/s) ร utilization ร 3600. We use 40% utilization by default.
What about Llama 3 using 15T tokens for 70B?
Llama 3 uses ~214ร Chinchilla (15T/70B). Meta chose a data-heavy strategy; performance gains from extra data can outweigh strict Chinchilla-optimal.
Where do the loss estimates come from?
We use a simplified Kaplan-style power law. Real loss depends on architecture, data quality, and training setup.
๐ Chinchilla by the Numbers
๐ Official Sources
โ ๏ธ Disclaimer: This calculator provides estimates based on Chinchilla scaling laws for educational and planning purposes. Actual optimal ratios depend on architecture, data quality, and use case. GPU throughput and utilization are approximations. For production decisions, validate with small-scale experiments and consult the cited papers.
Related Calculators
LLM Training Cost Estimator
Estimate LLM training costs using the C=6PD formula. Calculate GPU hours, total FLOPs, and dollar costs based on Chinchilla scaling laws.
Machine LearningGPU VRAM / Memory Requirements Calculator
Calculate GPU memory requirements for training and inference. Compare FP32, FP16, BF16, INT8, and INT4 precision formats.
Machine LearningToken Count & LLM API Cost Calculator
Compare token costs across OpenAI, Anthropic, Google, and Mistral. Calculate input vs output token pricing for any LLM API.
Machine LearningLoRA / QLoRA Fine-Tuning Parameter Calculator
Calculate trainable parameters, memory savings, and adapter sizes for LoRA and QLoRA fine-tuning of large language models.
Machine LearningNeural Network FLOPs Calculator
Calculate floating-point operations for neural network layers: Linear, Conv2D, Attention, LayerNorm, and Embedding.
Machine LearningNeural Network Parameter Counter
Count total parameters for neural network architectures. Supports Linear, Conv2D, Embedding, LayerNorm, and MultiHeadAttention layers.
Machine Learning