Training Data Size Estimation
Estimate optimal training data for pre-training (Chinchilla D=20P), fine-tuning (10K–100K examples), and instruction-tuning (LIMA 1K–50K). Hoffmann 2022, Sardana 2024, Zhou 2023, Llama 2.
Why This ML Metric Matters
Why: Data requirements vary by task: pre-training needs Chinchilla-optimal tokens; fine-tuning needs quality examples; instruction-tuning can succeed with fewer curated samples.
How: Pre-training: D=20P. Fine-tuning: 10K–100K SFT examples. Instruction-tuning: 1K–50K high-quality (LIMA).
- ●Chinchilla D=20P
- ●Fine-tuning 10K–100K
- ●LIMA 1K–50K
- ●Quality > quantity
How Much Training Data Do You Need?
Chinchilla D=20P for pre-training. LIMA 1K–50K for instruction-tuning. Llama 2 10K–100K for SFT. Estimate optimal data requirements here.
📊 Quick Examples — Click to Load
Inputs
Data Scaling Law Curve (D=20P)
Data-to-Parameter Ratio Comparison
For educational and informational purposes only. Verify with a qualified professional.
🤖 AI & ML Facts
LIMA achieved strong results with only ~1,000 curated instruction examples
— LIMA
Chinchilla 70B used 1.4T tokens (D=20P) and beat Gopher 280B trained on 300B tokens
— Chinchilla
Llama 2 used 27K–100K SFT examples; quality filtering was critical
— Llama 2
Beyond Chinchilla: for high-inference workloads, train 50% params on 2× tokens
— Sardana 2024
📋 Key Takeaways
- • Pre-training: Chinchilla D=20P — use ~20 tokens per parameter for compute-optimal training
- • Fine-tuning: 10K–100K examples (Llama 2 SFT); quality matters more than raw count
- • Instruction-tuning: LIMA showed 1K–50K high-quality examples can achieve strong alignment
- • Inference-heavy: Sardana et al. 2024 suggest training smaller models for 2× longer
💡 Did You Know
📖 How It Works
1. Pre-training (Chinchilla)
Hoffmann et al. 2022 found D ≈ 20P tokens per parameter for compute-optimal training. For 7B params, that's ~140B tokens.
2. Fine-tuning (SFT)
Llama 2 and similar models use 10K–100K supervised examples. Focus on diversity and quality over raw count.
3. Instruction-tuning (LIMA)
Zhou et al. 2023 showed 1K–50K high-quality examples suffice for strong instruction-following. Less can be more when data is curated.
4. Beyond Chinchilla
Sardana et al. 2024: for inference-heavy workloads, train smaller models for longer (e.g., 50% params, 2× tokens).
🎯 Expert Tips
Stick to D=20P for pre-training
Chinchilla ratio is the default; validate with small-scale runs before full training.
Quality over quantity
For fine-tuning and instruction-tuning, curate diverse, high-quality examples.
LIMA-style for alignment
Start with 1K–5K examples; scale up only if needed. Less can be more.
Modality matters
Code and images may need different ratios; use Chinchilla as a starting point.
⚖️ This Calculator vs. Other Tools
| Feature | This Calculator | Manual | Spreadsheet | Papers Only |
|---|---|---|---|---|
| Chinchilla D=20P | ✅ | ⚠️ | ⚠️ | ✅ |
| Fine-tuning 10K–100K | ✅ | ⚠️ | ⚠️ | ✅ |
| LIMA 1K–50K | ✅ | ❌ | ❌ | ⚠️ |
| Beyond Chinchilla | ✅ | ❌ | ❌ | ⚠️ |
| Data scaling charts | ✅ | ❌ | ❌ | ❌ |
| Example presets | ✅ | ❌ | ❌ | ❌ |
| Copy & share | ✅ | ❌ | ❌ | ❌ |
❓ Frequently Asked Questions
How much data do I need for pre-training?
Chinchilla (Hoffmann et al. 2022) recommends D ≈ 20P tokens. For 7B params, that's ~140B tokens.
What about fine-tuning?
Llama 2 and similar models use 10K–100K SFT examples. Quality and diversity matter more than count.
What did LIMA find?
LIMA (Zhou et al. 2023) showed ~1K high-quality instruction examples can achieve strong alignment. Less can be more.
When should I use Beyond Chinchilla?
For inference-heavy workloads (many more inference requests than training tokens), train smaller models for 2× longer (Sardana et al. 2024).
Does D=20P apply to code?
Chinchilla was trained on text. Code may have different optimal ratios; use as a starting point and validate.
How many tokens per example for fine-tuning?
Typically 200–1000 tokens per example; we use ~512 as an average for estimates.
What about instruction-tuning vs fine-tuning?
Instruction-tuning (LIMA-style) uses 1K–50K examples for alignment. SFT fine-tuning typically uses 10K–100K.
Where do these numbers come from?
Chinchilla (arxiv.org/abs/2203.15556), LIMA (arxiv.org/abs/2305.11206), Llama 2 (arxiv.org/abs/2307.09288), Beyond Chinchilla (arxiv.org/abs/2401.00448).
📊 Training Data by the Numbers
📚 Official Sources
⚠️ Disclaimer: This calculator provides estimates based on Chinchilla, LIMA, and Llama 2 research for educational and planning purposes. Actual data requirements depend on architecture, data quality, and use case. Validate with small-scale experiments and consult the cited papers for production decisions.
Related Calculators
Cross-Validation Sample Size Calculator
Calculate minimum sample sizes for reliable k-fold cross-validation with stratification and class imbalance.
Machine LearningEmbedding Dimension Calculator
Determine optimal embedding dimensions for LLMs, RAG, classification, and search. Balance memory vs expressiveness.
Machine LearningRAG Optimizer Calculator
Calculate chunk sizes, vector store memory, and token budgets for retrieval-augmented generation pipelines.
Machine LearningActivation Memory Calculator
Estimate activation memory with and without gradient checkpointing. Based on NVIDIA selective recomputation research.
Machine LearningAI Fairness & Bias Calculator
Calculate demographic parity, equalized odds, equal opportunity, and disparate impact ratio. Based on IBM AIF360 and Microsoft Fairlearn.
Machine LearningAttention Head Configuration Calculator
Configure MHA, MQA, and GQA attention. Calculate head counts, dimensions, KV cache savings, and memory per attention type.
Machine Learning