What is Chinchilla scaling?

DeepMind 2022: for a given compute budget, smaller models trained on more tokens (D=20P) outperform larger undertrained models.

What does D=20P mean?

Optimal training tokens = 20 × model parameters. A 7B model should train on ~140B tokens.

Total compute (FLOPs) = 6 × parameters × tokens. Used to derive optimal model size from a compute budget.

Sardana et al. 2024: inference-adjusted scaling may favor training smaller models longer for deployment efficiency.

5 more

TRAININGLLM Training & ScalingML Calculator

🦙

Chinchilla Scaling Laws

Compute-optimal model size and training tokens. DeepMind 2022: D=20P, C=6PD. Smaller models trained on more data beat larger undertrained models.

Concept Fundamentals

C = 6PD

Compute Law

6 × params × tokens

D* ∝ C^0.5

Optimal Scaling

Compute-optimal training

~20 per parameter

Token Ratio

Data-to-model balance

Hoffmann et al. 2022

Paper

Chinchilla scaling laws

D=20P: More Data Beats Bigger ModelsChinchilla scaling laws

Why This ML Metric Matters

Why: For a given compute budget, training smaller models on more tokens yields better performance than undertraining larger models.

How: The calculator applies D=20P (optimal tokens per parameter) and C=6PD (compute formula) to derive optimal model size from FLOPs or GPU hours.

Sources:Hoffmann et al. 2022 - Training Compute-Optimal Large Language Models (Chinchilla)Sardana et al. 2024 - Beyond Chinchilla-Optimal

🦙

CHINCHILLA CHANGED LLM TRAINING

D=20P: More Data Beats Bigger Models

DeepMind's Chinchilla (2022) showed that for a given compute budget, smaller models trained on more tokens outperform larger undertrained models. Find your compute-optimal size here.

LLM Training Cost →GPU VRAM →

📊 Quick Examples — Click to Load

Inputs

Compute input:FLOPsGPU-hours

Compute Budget (FLOPs)e.g., 3e24 for 3×10²⁴ FLOPs

Modality

chinchilla-scaling.sh

OPTIMAL

Optimal Params

158.1B

Optimal Tokens

3.16T

Tokens/Param

Compute Needed

3.00 YFLOP

Est. Loss (rel)

1.2226

Chinchilla Compute-Optimal

D = 20P

158.1B params → 3.16T tokens

Compute: 3.00 YFLOP|Tokens/Param: 20

numbervibe.com/calculators/machine-learning/chinchilla-scaling-calculator

Scaling Law Curve (Loss vs Compute)

Optimal Frontier (Params vs Tokens)

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

🦙

Chinchilla 2022: 70B model trained on 1.4T tokens outperformed Gopher 280B trained on 300B

— Hoffmann et al.

📐

D=20P: optimal tokens = 20 × parameters. C=6PD: compute = 6 × P × D

— Chinchilla

⚡

Sardana 2024: inference-adjusted scaling may favor smaller models trained longer

— Beyond Chinchilla

📊

Kaplan 2020: loss follows power law L ∝ 1/P^α + 1/D^β

— OpenAI scaling

Chinchilla “API sheet” (March 2026)

DeepMind Chinchilla (2022): for a fixed training FLOP budget, train smaller models on more tokens — optimal tokens ≈ 20× parameters (D = 20P). Total training compute ≈ 6PD FLOPs (forward+backward heuristic). Frontier labs often exceed 20× in practice for multimodal or continued pre-training — treat as a baseline, not a contractual SLA.

Formula	Role
D = 20P	Target training tokens D for parameter count P
C ≈ 6PD	Order-of-magnitude training FLOPs
Inference trade-offs	See Sardana et al. — smaller models can win at deployment even if training-optimal size differs

📋 Key Takeaways

• D = 20P: For compute-optimal training, use ~20 tokens per parameter (Chinchilla 2022)
• Undertrained models waste compute: larger models trained on fewer tokens underperform smaller, well-trained models
• Chinchilla 70B beat Gopher 280B with 4× fewer parameters by training on 4× more data
• Inference considerations: for high-inference workloads, train smaller + longer (Sardana et al. 2024)

💡 Did You Know

🦙Chinchilla 70B beat Gopher 280B with 4× fewer parameters by training on 4× more data

📊DeepMind tested 400+ model configurations to derive the D=20P scaling law

📐Llama 3 70B uses 15T tokens = 214× Chinchilla ratio (data-heavy strategy)

📅Chinchilla was published in 2022 and changed how the industry thinks about LLM training

⚡The C=6PD formula: each parameter-token pair requires ~6 FLOPs (forward + backward)

🔬Kaplan et al. 2020 first showed power-law scaling; Chinchilla refined the optimal data ratio

🌐howmanyparams.com tracks parameter counts for 100+ LLMs and vision models

📖 How It Works

1. Chinchilla Scaling Law

Hoffmann et al. 2022 found that for a given compute budget C, the optimal model uses D ≈ 20P tokens. This implies C = 6PD = 120P², so P_opt = √(C/120).

2. Compute Budget

Enter FLOPs directly or GPU-hours × GPU count × throughput × utilization. We use 40% utilization as typical for large-scale training.

3. Mode 2: Model → Tokens

Given a model size P, Chinchilla-optimal tokens are D = 20P. For high-inference workloads, Sardana et al. suggest training smaller models for longer.

4. Loss Estimation

We use a simplified Kaplan-style power law: L ≈ A/P^α + B/D^β. The exact coefficients vary by model family and data quality.

🎯 Expert Tips

Stick to D=20P for pre-training

Unless you have evidence otherwise, Chinchilla ratio is a strong default for compute-optimal training.

Inference-heavy? Train smaller

If inference cost dominates, consider 50% params with 2× tokens per Sardana et al. 2024.

Data quality matters

Chinchilla assumes high-quality data. Low-quality data may need different scaling.

Validate with small runs

Run 1% scale experiments to validate scaling assumptions before full training.

⚖️ This Calculator vs. Other Tools

Feature	This Calculator	Manual	Spreadsheet	Papers Only
D=20P Chinchilla formula	✅	⚠️	⚠️	✅
Compute → Model Size	✅	❌	⚠️	❌
Model → Optimal Tokens	✅	⚠️	⚠️	❌
GPU-hours to FLOPs	✅	❌	⚠️	❌
Inference-adjusted (Sardana)	✅	❌	❌	⚠️
Scaling law charts	✅	❌	❌	❌
Example presets	✅	❌	❌	❌
Copy & share	✅	❌	❌	❌

❓ Frequently Asked Questions

What is the Chinchilla scaling law?

Chinchilla (Hoffmann et al. 2022) found that for compute-optimal training, use D ≈ 20P tokens per parameter. Larger models trained on fewer tokens are undertrained and underperform.

Why did Chinchilla 70B beat Gopher 280B?

Chinchilla used 4× more training data (1.4T vs 300B tokens) with 4× fewer parameters. Better data scaling beat raw parameter count.

What is the C=6PD formula?

C = 6 × P × D estimates total FLOPs for transformer training. Each parameter sees each token ~6 times (forward + backward passes).

When should I use inference-adjusted scaling?

When inference requests greatly exceed training tokens (e.g., 1B+ requests for a 7B model). Sardana et al. 2024 suggest training smaller models for longer.

Does D=20P apply to code or images?

Chinchilla was trained on text. Code and images may have different optimal ratios; use as a starting point and validate.

How do I convert GPU-hours to FLOPs?

FLOPs = GPU-hours × GPU count × throughput (TFLOP/s) × utilization × 3600. We use 40% utilization by default.

What about Llama 3 using 15T tokens for 70B?

Llama 3 uses ~214× Chinchilla (15T/70B). Meta chose a data-heavy strategy; performance gains from extra data can outweigh strict Chinchilla-optimal.

Where do the loss estimates come from?

We use a simplified Kaplan-style power law. Real loss depends on architecture, data quality, and training setup.

📊 Chinchilla by the Numbers

D=20P

Optimal Ratio

70B

Chinchilla beat 280B Gopher

400+

Models Tested

2022

Chinchilla Published

📚 Official Sources

Hoffmann et al. 2022 - Training Compute-Optimal Large Language Models (Chinchilla) ↗

DeepMind Chinchilla scaling laws, D=20P, C=6PD

Updated: 2022

Sardana et al. 2024 - Beyond Chinchilla-Optimal ↗

Inference-adjusted scaling, train smaller + longer

Updated: 2024

Kaplan et al. 2020 - Scaling Laws for Neural Language Models ↗

OpenAI original scaling laws, loss power law

Updated: 2020

howmanyparams.com ↗

Model parameter counts and scaling references