DATAData & RAGML Calculator
📊

Training Data Size Estimation

Estimate optimal training data for pre-training (Chinchilla D=20P), fine-tuning (10K–100K examples), and instruction-tuning (LIMA 1K–50K). Hoffmann 2022, Sardana 2024, Zhou 2023, Llama 2.

Concept Fundamentals
D* ≈ 20 × P
Chinchilla
Optimal tokens-to-params
C = 6PD
Scaling Law
Compute formula
> quantity
Data Quality
Curation matters
Optimal dataset sizing
Application
Training data planning
CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: Data requirements vary by task: pre-training needs Chinchilla-optimal tokens; fine-tuning needs quality examples; instruction-tuning can succeed with fewer curated samples.

How: Pre-training: D=20P. Fine-tuning: 10K–100K SFT examples. Instruction-tuning: 1K–50K high-quality (LIMA).

Chinchilla scaling laws (D≈20P)Fine-tuning data requirements
  • Chinchilla D=20P
  • Fine-tuning 10K–100K
  • LIMA 1K–50K
  • Quality > quantity
📊
CHINCHILLA + LIMA + LLAMA 2

How Much Training Data Do You Need?

Chinchilla D=20P for pre-training. LIMA 1K–50K for instruction-tuning. Llama 2 10K–100K for SFT. Estimate optimal data requirements here.

📊 Quick Examples — Click to Load

Inputs

e.g., 7 for 7B
e.g., 1e24 for planning
for advanced planning
training-data-estimator.sh
ESTIMATED
Optimal Tokens
140.0B
Optimal Examples
Tokens/Param
20.0
Recommendation
Chinchilla-optimal: 140.0B tokens (D=20P). For inference-heavy workloads, consider 2× tokens with 50% params (Sardana et al. 2024).
Share:
Training Data Estimate
pre-training
140.0B tokens
7B params|Tokens/Param: 20.0
numbervibe.com/calculators/machine-learning/training-data-estimator

Data Scaling Law Curve (D=20P)

Data-to-Parameter Ratio Comparison

⚠️For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

🦙

LIMA achieved strong results with only ~1,000 curated instruction examples

— LIMA

📊

Chinchilla 70B used 1.4T tokens (D=20P) and beat Gopher 280B trained on 300B tokens

— Chinchilla

📐

Llama 2 used 27K–100K SFT examples; quality filtering was critical

— Llama 2

Beyond Chinchilla: for high-inference workloads, train 50% params on 2× tokens

— Sardana 2024

📋 Key Takeaways

  • • Pre-training: Chinchilla D=20P — use ~20 tokens per parameter for compute-optimal training
  • • Fine-tuning: 10K–100K examples (Llama 2 SFT); quality matters more than raw count
  • • Instruction-tuning: LIMA showed 1K–50K high-quality examples can achieve strong alignment
  • • Inference-heavy: Sardana et al. 2024 suggest training smaller models for 2× longer

💡 Did You Know

🦙LIMA achieved strong results with only ~1,000 curated instruction examples
📊Chinchilla 70B used 1.4T tokens (D=20P) and beat Gopher 280B trained on 300B tokens
📐Llama 2 used 27K–100K SFT examples; quality filtering was critical
Beyond Chinchilla: for high-inference workloads, train 50% params on 2× tokens
🔬Code pre-training may need different ratios; validate with small runs
📅LIMA (2023) challenged the "more data is better" assumption for alignment
🌐Data quality and diversity often matter more than dataset size for fine-tuning

📖 How It Works

1. Pre-training (Chinchilla)

Hoffmann et al. 2022 found D ≈ 20P tokens per parameter for compute-optimal training. For 7B params, that's ~140B tokens.

2. Fine-tuning (SFT)

Llama 2 and similar models use 10K–100K supervised examples. Focus on diversity and quality over raw count.

3. Instruction-tuning (LIMA)

Zhou et al. 2023 showed 1K–50K high-quality examples suffice for strong instruction-following. Less can be more when data is curated.

4. Beyond Chinchilla

Sardana et al. 2024: for inference-heavy workloads, train smaller models for longer (e.g., 50% params, 2× tokens).

🎯 Expert Tips

Stick to D=20P for pre-training

Chinchilla ratio is the default; validate with small-scale runs before full training.

Quality over quantity

For fine-tuning and instruction-tuning, curate diverse, high-quality examples.

LIMA-style for alignment

Start with 1K–5K examples; scale up only if needed. Less can be more.

Modality matters

Code and images may need different ratios; use Chinchilla as a starting point.

⚖️ This Calculator vs. Other Tools

FeatureThis CalculatorManualSpreadsheetPapers Only
Chinchilla D=20P⚠️⚠️
Fine-tuning 10K–100K⚠️⚠️
LIMA 1K–50K⚠️
Beyond Chinchilla⚠️
Data scaling charts
Example presets
Copy & share

❓ Frequently Asked Questions

How much data do I need for pre-training?

Chinchilla (Hoffmann et al. 2022) recommends D ≈ 20P tokens. For 7B params, that's ~140B tokens.

What about fine-tuning?

Llama 2 and similar models use 10K–100K SFT examples. Quality and diversity matter more than count.

What did LIMA find?

LIMA (Zhou et al. 2023) showed ~1K high-quality instruction examples can achieve strong alignment. Less can be more.

When should I use Beyond Chinchilla?

For inference-heavy workloads (many more inference requests than training tokens), train smaller models for 2× longer (Sardana et al. 2024).

Does D=20P apply to code?

Chinchilla was trained on text. Code may have different optimal ratios; use as a starting point and validate.

How many tokens per example for fine-tuning?

Typically 200–1000 tokens per example; we use ~512 as an average for estimates.

What about instruction-tuning vs fine-tuning?

Instruction-tuning (LIMA-style) uses 1K–50K examples for alignment. SFT fine-tuning typically uses 10K–100K.

Where do these numbers come from?

Chinchilla (arxiv.org/abs/2203.15556), LIMA (arxiv.org/abs/2305.11206), Llama 2 (arxiv.org/abs/2307.09288), Beyond Chinchilla (arxiv.org/abs/2401.00448).

📊 Training Data by the Numbers

D=20P
Chinchilla Ratio
1K–50K
LIMA Examples
10K–100K
SFT Examples
2022–2024
Key Papers

⚠️ Disclaimer: This calculator provides estimates based on Chinchilla, LIMA, and Llama 2 research for educational and planning purposes. Actual data requirements depend on architecture, data quality, and use case. Validate with small-scale experiments and consult the cited papers for production decisions.

👈 START HERE
⬅️Jump in and explore the concept!
AI