Training Data Size Estimation
Estimate optimal training data for pre-training (Chinchilla D=20P), fine-tuning (10K–100K examples), and instruction-tuning (LIMA 1K–50K). Hoffmann 2022, Sardana 2024, Zhou 2023, Llama 2.
Why This ML Metric Matters
Why: Data requirements vary by task: pre-training needs Chinchilla-optimal tokens; fine-tuning needs quality examples; instruction-tuning can succeed with fewer curated samples.
How: Pre-training: D=20P. Fine-tuning: 10K–100K SFT examples. Instruction-tuning: 1K–50K high-quality (LIMA).
- ●Chinchilla D=20P
- ●Fine-tuning 10K–100K
- ●LIMA 1K–50K
- ●Quality > quantity
How Much Training Data Do You Need?
Chinchilla D=20P for pre-training. LIMA 1K–50K for instruction-tuning. Llama 2 10K–100K for SFT. Estimate optimal data requirements here.
📊 Quick Examples — Click to Load
Inputs
Data Scaling Law Curve (D=20P)
Data-to-Parameter Ratio Comparison
⚠️For educational and informational purposes only. Verify with a qualified professional.
🤖 AI & ML Facts
LIMA achieved strong results with only ~1,000 curated instruction examples
— LIMA
Chinchilla 70B used 1.4T tokens (D=20P) and beat Gopher 280B trained on 300B tokens
— Chinchilla
Llama 2 used 27K–100K SFT examples; quality filtering was critical
— Llama 2
Beyond Chinchilla: for high-inference workloads, train 50% params on 2× tokens
— Sardana 2024
📋 Key Takeaways
- • Pre-training: Chinchilla D=20P — use ~20 tokens per parameter for compute-optimal training
- • Fine-tuning: 10K–100K examples (Llama 2 SFT); quality matters more than raw count
- • Instruction-tuning: LIMA showed 1K–50K high-quality examples can achieve strong alignment
- • Inference-heavy: Sardana et al. 2024 suggest training smaller models for 2× longer
💡 Did You Know
📖 How It Works
1. Pre-training (Chinchilla)
Hoffmann et al. 2022 found D ≈ 20P tokens per parameter for compute-optimal training. For 7B params, that's ~140B tokens.
2. Fine-tuning (SFT)
Llama 2 and similar models use 10K–100K supervised examples. Focus on diversity and quality over raw count.
3. Instruction-tuning (LIMA)
Zhou et al. 2023 showed 1K–50K high-quality examples suffice for strong instruction-following. Less can be more when data is curated.
4. Beyond Chinchilla
Sardana et al. 2024: for inference-heavy workloads, train smaller models for longer (e.g., 50% params, 2× tokens).
🎯 Expert Tips
Stick to D=20P for pre-training
Chinchilla ratio is the default; validate with small-scale runs before full training.
Quality over quantity
For fine-tuning and instruction-tuning, curate diverse, high-quality examples.
LIMA-style for alignment
Start with 1K–5K examples; scale up only if needed. Less can be more.
Modality matters
Code and images may need different ratios; use Chinchilla as a starting point.
⚖️ This Calculator vs. Other Tools
| Feature | This Calculator | Manual | Spreadsheet | Papers Only |
|---|---|---|---|---|
| Chinchilla D=20P | ✅ | ⚠️ | ⚠️ | ✅ |
| Fine-tuning 10K–100K | ✅ | ⚠️ | ⚠️ | ✅ |
| LIMA 1K–50K | ✅ | ❌ | ❌ | ⚠️ |
| Beyond Chinchilla | ✅ | ❌ | ❌ | ⚠️ |
| Data scaling charts | ✅ | ❌ | ❌ | ❌ |
| Example presets | ✅ | ❌ | ❌ | ❌ |
| Copy & share | ✅ | ❌ | ❌ | ❌ |
❓ Frequently Asked Questions
How much data do I need for pre-training?
Chinchilla (Hoffmann et al. 2022) recommends D ≈ 20P tokens. For 7B params, that's ~140B tokens.
What about fine-tuning?
Llama 2 and similar models use 10K–100K SFT examples. Quality and diversity matter more than count.
What did LIMA find?
LIMA (Zhou et al. 2023) showed ~1K high-quality instruction examples can achieve strong alignment. Less can be more.
When should I use Beyond Chinchilla?
For inference-heavy workloads (many more inference requests than training tokens), train smaller models for 2× longer (Sardana et al. 2024).
Does D=20P apply to code?
Chinchilla was trained on text. Code may have different optimal ratios; use as a starting point and validate.
How many tokens per example for fine-tuning?
Typically 200–1000 tokens per example; we use ~512 as an average for estimates.
What about instruction-tuning vs fine-tuning?
Instruction-tuning (LIMA-style) uses 1K–50K examples for alignment. SFT fine-tuning typically uses 10K–100K.
Where do these numbers come from?
Chinchilla (arxiv.org/abs/2203.15556), LIMA (arxiv.org/abs/2305.11206), Llama 2 (arxiv.org/abs/2307.09288), Beyond Chinchilla (arxiv.org/abs/2401.00448).
📊 Training Data by the Numbers
📚 Official Sources
⚠️ Disclaimer: This calculator provides estimates based on Chinchilla, LIMA, and Llama 2 research for educational and planning purposes. Actual data requirements depend on architecture, data quality, and use case. Validate with small-scale experiments and consult the cited papers for production decisions.