TRAININGLLM Training & ScalingML Calculator
๐Ÿฆ™

Chinchilla Scaling Laws

Compute-optimal model size and training tokens. DeepMind 2022: D=20P, C=6PD. Smaller models trained on more data beat larger undertrained models.

Concept Fundamentals
C = 6PD
Compute Law
6 ร— params ร— tokens
D* โˆ C^0.5
Optimal Scaling
Compute-optimal training
~20 per parameter
Token Ratio
Data-to-model balance
Hoffmann et al. 2022
Paper
Chinchilla scaling laws
D=20P: More Data Beats Bigger ModelsChinchilla scaling laws

Why This ML Metric Matters

Why: For a given compute budget, training smaller models on more tokens yields better performance than undertraining larger models.

How: The calculator applies D=20P (optimal tokens per parameter) and C=6PD (compute formula) to derive optimal model size from FLOPs or GPU hours.

๐Ÿฆ™
CHINCHILLA CHANGED LLM TRAINING

D=20P: More Data Beats Bigger Models

DeepMind's Chinchilla (2022) showed that for a given compute budget, smaller models trained on more tokens outperform larger undertrained models. Find your compute-optimal size here.

๐Ÿ“Š Quick Examples โ€” Click to Load

Inputs

Compute input:
e.g., 3e24 for 3ร—10ยฒโด FLOPs
chinchilla-scaling.sh
OPTIMAL
Optimal Params
158.1B
Optimal Tokens
3.16T
Tokens/Param
20
Compute Needed
3.00 YFLOP
Est. Loss (rel)
1.2226
Share:
Chinchilla Compute-Optimal
D = 20P
158.1B params โ†’ 3.16T tokens
Compute: 3.00 YFLOP|Tokens/Param: 20
numbervibe.com/calculators/machine-learning/chinchilla-scaling-calculator

Scaling Law Curve (Loss vs Compute)

Optimal Frontier (Params vs Tokens)

For educational and informational purposes only. Verify with a qualified professional.

๐Ÿค– AI & ML Facts

๐Ÿฆ™

Chinchilla 2022: 70B model trained on 1.4T tokens outperformed Gopher 280B trained on 300B

โ€” Hoffmann et al.

๐Ÿ“

D=20P: optimal tokens = 20 ร— parameters. C=6PD: compute = 6 ร— P ร— D

โ€” Chinchilla

โšก

Sardana 2024: inference-adjusted scaling may favor smaller models trained longer

โ€” Beyond Chinchilla

๐Ÿ“Š

Kaplan 2020: loss follows power law L โˆ 1/P^ฮฑ + 1/D^ฮฒ

โ€” OpenAI scaling

Chinchilla โ€œAPI sheetโ€ (March 2026)

DeepMind Chinchilla (2022): for a fixed training FLOP budget, train smaller models on more tokens โ€” optimal tokens โ‰ˆ 20ร— parameters (D = 20P). Total training compute โ‰ˆ 6PD FLOPs (forward+backward heuristic). Frontier labs often exceed 20ร— in practice for multimodal or continued pre-training โ€” treat as a baseline, not a contractual SLA.

FormulaRole
D = 20PTarget training tokens D for parameter count P
C โ‰ˆ 6PDOrder-of-magnitude training FLOPs
Inference trade-offsSee Sardana et al. โ€” smaller models can win at deployment even if training-optimal size differs

๐Ÿ“‹ Key Takeaways

  • โ€ข D = 20P: For compute-optimal training, use ~20 tokens per parameter (Chinchilla 2022)
  • โ€ข Undertrained models waste compute: larger models trained on fewer tokens underperform smaller, well-trained models
  • โ€ข Chinchilla 70B beat Gopher 280B with 4ร— fewer parameters by training on 4ร— more data
  • โ€ข Inference considerations: for high-inference workloads, train smaller + longer (Sardana et al. 2024)

๐Ÿ’ก Did You Know

๐Ÿฆ™Chinchilla 70B beat Gopher 280B with 4ร— fewer parameters by training on 4ร— more data
๐Ÿ“ŠDeepMind tested 400+ model configurations to derive the D=20P scaling law
๐Ÿ“Llama 3 70B uses 15T tokens = 214ร— Chinchilla ratio (data-heavy strategy)
๐Ÿ“…Chinchilla was published in 2022 and changed how the industry thinks about LLM training
โšกThe C=6PD formula: each parameter-token pair requires ~6 FLOPs (forward + backward)
๐Ÿ”ฌKaplan et al. 2020 first showed power-law scaling; Chinchilla refined the optimal data ratio
๐ŸŒhowmanyparams.com tracks parameter counts for 100+ LLMs and vision models

๐Ÿ“– How It Works

1. Chinchilla Scaling Law

Hoffmann et al. 2022 found that for a given compute budget C, the optimal model uses D โ‰ˆ 20P tokens. This implies C = 6PD = 120Pยฒ, so P_opt = โˆš(C/120).

2. Compute Budget

Enter FLOPs directly or GPU-hours ร— GPU count ร— throughput ร— utilization. We use 40% utilization as typical for large-scale training.

3. Mode 2: Model โ†’ Tokens

Given a model size P, Chinchilla-optimal tokens are D = 20P. For high-inference workloads, Sardana et al. suggest training smaller models for longer.

4. Loss Estimation

We use a simplified Kaplan-style power law: L โ‰ˆ A/P^ฮฑ + B/D^ฮฒ. The exact coefficients vary by model family and data quality.

๐ŸŽฏ Expert Tips

Stick to D=20P for pre-training

Unless you have evidence otherwise, Chinchilla ratio is a strong default for compute-optimal training.

Inference-heavy? Train smaller

If inference cost dominates, consider 50% params with 2ร— tokens per Sardana et al. 2024.

Data quality matters

Chinchilla assumes high-quality data. Low-quality data may need different scaling.

Validate with small runs

Run 1% scale experiments to validate scaling assumptions before full training.

โš–๏ธ This Calculator vs. Other Tools

FeatureThis CalculatorManualSpreadsheetPapers Only
D=20P Chinchilla formulaโœ…โš ๏ธโš ๏ธโœ…
Compute โ†’ Model Sizeโœ…โŒโš ๏ธโŒ
Model โ†’ Optimal Tokensโœ…โš ๏ธโš ๏ธโŒ
GPU-hours to FLOPsโœ…โŒโš ๏ธโŒ
Inference-adjusted (Sardana)โœ…โŒโŒโš ๏ธ
Scaling law chartsโœ…โŒโŒโŒ
Example presetsโœ…โŒโŒโŒ
Copy & shareโœ…โŒโŒโŒ

โ“ Frequently Asked Questions

What is the Chinchilla scaling law?

Chinchilla (Hoffmann et al. 2022) found that for compute-optimal training, use D โ‰ˆ 20P tokens per parameter. Larger models trained on fewer tokens are undertrained and underperform.

Why did Chinchilla 70B beat Gopher 280B?

Chinchilla used 4ร— more training data (1.4T vs 300B tokens) with 4ร— fewer parameters. Better data scaling beat raw parameter count.

What is the C=6PD formula?

C = 6 ร— P ร— D estimates total FLOPs for transformer training. Each parameter sees each token ~6 times (forward + backward passes).

When should I use inference-adjusted scaling?

When inference requests greatly exceed training tokens (e.g., 1B+ requests for a 7B model). Sardana et al. 2024 suggest training smaller models for longer.

Does D=20P apply to code or images?

Chinchilla was trained on text. Code and images may have different optimal ratios; use as a starting point and validate.

How do I convert GPU-hours to FLOPs?

FLOPs = GPU-hours ร— GPU count ร— throughput (TFLOP/s) ร— utilization ร— 3600. We use 40% utilization by default.

What about Llama 3 using 15T tokens for 70B?

Llama 3 uses ~214ร— Chinchilla (15T/70B). Meta chose a data-heavy strategy; performance gains from extra data can outweigh strict Chinchilla-optimal.

Where do the loss estimates come from?

We use a simplified Kaplan-style power law. Real loss depends on architecture, data quality, and training setup.

๐Ÿ“Š Chinchilla by the Numbers

D=20P
Optimal Ratio
70B
Chinchilla beat 280B Gopher
400+
Models Tested
2022
Chinchilla Published

โš ๏ธ Disclaimer: This calculator provides estimates based on Chinchilla scaling laws for educational and planning purposes. Actual optimal ratios depend on architecture, data quality, and use case. GPU throughput and utilization are approximations. For production decisions, validate with small-scale experiments and consult the cited papers.

๐Ÿ‘ˆ START HERE
โฌ…๏ธJump in and explore the concept!
AI

Related Calculators