5 more

DATAData & RAGML Calculator

📊

Training Data Size Estimation

Estimate optimal training data for pre-training (Chinchilla D=20P), fine-tuning (10K–100K examples), and instruction-tuning (LIMA 1K–50K). Hoffmann 2022, Sardana 2024, Zhou 2023, Llama 2.

Concept Fundamentals

D* ≈ 20 × P

Chinchilla

Optimal tokens-to-params

C = 6PD

Scaling Law

Compute formula

> quantity

Data Quality

Curation matters

Optimal dataset sizing

Application

Training data planning

CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: Data requirements vary by task: pre-training needs Chinchilla-optimal tokens; fine-tuning needs quality examples; instruction-tuning can succeed with fewer curated samples.

How: Pre-training: D=20P. Fine-tuning: 10K–100K SFT examples. Instruction-tuning: 1K–50K high-quality (LIMA).

Chinchilla scaling laws (D≈20P)Fine-tuning data requirements

●Chinchilla D=20P
●Fine-tuning 10K–100K
●LIMA 1K–50K
●Quality > quantity

Sources:Hoffmann et al. 2022 - Training Compute-Optimal Large Language Models (Chinchilla)Sardana et al. 2024 - Beyond Chinchilla-Optimal

📊

CHINCHILLA + LIMA + LLAMA 2

How Much Training Data Do You Need?

Chinchilla D=20P for pre-training. LIMA 1K–50K for instruction-tuning. Llama 2 10K–100K for SFT. Estimate optimal data requirements here.

Chinchilla Scaling →LLM Training Cost →

📊 Quick Examples — Click to Load

Inputs

Model Params (B)e.g., 7 for 7B

Task

Modality

Compute Budget (FLOPs) [optional]e.g., 1e24 for planning

Desired Loss Target [optional]for advanced planning

training-data-estimator.sh

ESTIMATED

Optimal Tokens

140.0B

Optimal Examples

—

Tokens/Param

20.0

Recommendation

Chinchilla-optimal: 140.0B tokens (D=20P). For inference-heavy workloads, consider 2× tokens with 50% params (Sardana et al. 2024).

Training Data Estimate

pre-training

140.0B tokens

7B params|Tokens/Param: 20.0

numbervibe.com/calculators/machine-learning/training-data-estimator

Data Scaling Law Curve (D=20P)

Data-to-Parameter Ratio Comparison

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

🦙

LIMA achieved strong results with only ~1,000 curated instruction examples

— LIMA

📊

Chinchilla 70B used 1.4T tokens (D=20P) and beat Gopher 280B trained on 300B tokens

— Chinchilla

📐

Llama 2 used 27K–100K SFT examples; quality filtering was critical

— Llama 2

⚡

Beyond Chinchilla: for high-inference workloads, train 50% params on 2× tokens

— Sardana 2024

📋 Key Takeaways

• Pre-training: Chinchilla D=20P — use ~20 tokens per parameter for compute-optimal training
• Fine-tuning: 10K–100K examples (Llama 2 SFT); quality matters more than raw count
• Instruction-tuning: LIMA showed 1K–50K high-quality examples can achieve strong alignment
• Inference-heavy: Sardana et al. 2024 suggest training smaller models for 2× longer

💡 Did You Know

🦙LIMA achieved strong results with only ~1,000 curated instruction examples

📊Chinchilla 70B used 1.4T tokens (D=20P) and beat Gopher 280B trained on 300B tokens

📐Llama 2 used 27K–100K SFT examples; quality filtering was critical

⚡Beyond Chinchilla: for high-inference workloads, train 50% params on 2× tokens

🔬Code pre-training may need different ratios; validate with small runs

📅LIMA (2023) challenged the "more data is better" assumption for alignment

🌐Data quality and diversity often matter more than dataset size for fine-tuning

📖 How It Works

1. Pre-training (Chinchilla)

Hoffmann et al. 2022 found D ≈ 20P tokens per parameter for compute-optimal training. For 7B params, that's ~140B tokens.

2. Fine-tuning (SFT)

Llama 2 and similar models use 10K–100K supervised examples. Focus on diversity and quality over raw count.

3. Instruction-tuning (LIMA)

Zhou et al. 2023 showed 1K–50K high-quality examples suffice for strong instruction-following. Less can be more when data is curated.

4. Beyond Chinchilla

Sardana et al. 2024: for inference-heavy workloads, train smaller models for longer (e.g., 50% params, 2× tokens).

🎯 Expert Tips

Stick to D=20P for pre-training

Chinchilla ratio is the default; validate with small-scale runs before full training.

Quality over quantity

For fine-tuning and instruction-tuning, curate diverse, high-quality examples.

LIMA-style for alignment

Start with 1K–5K examples; scale up only if needed. Less can be more.

Modality matters

Code and images may need different ratios; use Chinchilla as a starting point.

⚖️ This Calculator vs. Other Tools

Feature	This Calculator	Manual	Spreadsheet	Papers Only
Chinchilla D=20P	✅	⚠️	⚠️	✅
Fine-tuning 10K–100K	✅	⚠️	⚠️	✅
LIMA 1K–50K	✅	❌	❌	⚠️
Beyond Chinchilla	✅	❌	❌	⚠️
Data scaling charts	✅	❌	❌	❌
Example presets	✅	❌	❌	❌
Copy & share	✅	❌	❌	❌

❓ Frequently Asked Questions

How much data do I need for pre-training?

Chinchilla (Hoffmann et al. 2022) recommends D ≈ 20P tokens. For 7B params, that's ~140B tokens.

What about fine-tuning?

Llama 2 and similar models use 10K–100K SFT examples. Quality and diversity matter more than count.

What did LIMA find?

LIMA (Zhou et al. 2023) showed ~1K high-quality instruction examples can achieve strong alignment. Less can be more.

When should I use Beyond Chinchilla?

For inference-heavy workloads (many more inference requests than training tokens), train smaller models for 2× longer (Sardana et al. 2024).

Does D=20P apply to code?

Chinchilla was trained on text. Code may have different optimal ratios; use as a starting point and validate.

How many tokens per example for fine-tuning?

Typically 200–1000 tokens per example; we use ~512 as an average for estimates.

What about instruction-tuning vs fine-tuning?

Instruction-tuning (LIMA-style) uses 1K–50K examples for alignment. SFT fine-tuning typically uses 10K–100K.

Where do these numbers come from?

Chinchilla (arxiv.org/abs/2203.15556), LIMA (arxiv.org/abs/2305.11206), Llama 2 (arxiv.org/abs/2307.09288), Beyond Chinchilla (arxiv.org/abs/2401.00448).

📊 Training Data by the Numbers

D=20P

Chinchilla Ratio

1K–50K

LIMA Examples

10K–100K

SFT Examples

2022–2024

Key Papers

📚 Official Sources

Hoffmann et al. 2022 - Training Compute-Optimal Large Language Models (Chinchilla) ↗

Chinchilla scaling laws, D=20P tokens per parameter

Updated: 2022

Sardana et al. 2024 - Beyond Chinchilla-Optimal ↗

Inference-adjusted scaling, train smaller + longer

Updated: 2024

Zhou et al. 2023 - LIMA: Less Is More for Alignment ↗

1K–50K high-quality examples for instruction-tuning

Updated: 2023

Touvron et al. 2023 - Llama 2: Open Foundation and Fine-Tuned Chat Models ↗

Llama 2 training data and SFT guidelines

Updated: 2023

⚠️ Disclaimer: This calculator provides estimates based on Chinchilla, LIMA, and Llama 2 research for educational and planning purposes. Actual data requirements depend on architecture, data quality, and use case. Validate with small-scale experiments and consult the cited papers for production decisions.

🚀 DIVING IN

🏊Let's explore the numbers!

Training Data Size Estimation

Why This ML Metric Matters

How Much Training Data Do You Need?

📊 Quick Examples — Click to Load

Inputs

Data Scaling Law Curve (D=20P)

Data-to-Parameter Ratio Comparison

🤖 AI & ML Facts

📋 Key Takeaways

💡 Did You Know

📖 How It Works

1. Pre-training (Chinchilla)

2. Fine-tuning (SFT)

3. Instruction-tuning (LIMA)

4. Beyond Chinchilla

🎯 Expert Tips

Stick to D=20P for pre-training

Quality over quantity

LIMA-style for alignment

Modality matters

⚖️ This Calculator vs. Other Tools

❓ Frequently Asked Questions

How much data do I need for pre-training?

What about fine-tuning?

What did LIMA find?

When should I use Beyond Chinchilla?

Does D=20P apply to code?

How many tokens per example for fine-tuning?

What about instruction-tuning vs fine-tuning?

Where do these numbers come from?

📊 Training Data by the Numbers

📚 Official Sources

Related Calculators

Cross-Validation Sample Size Calculator

Embedding Dimension Calculator

RAG Optimizer Calculator

Activation Memory Calculator

AI Fairness & Bias Calculator

Attention Head Configuration Calculator

We Value Your Privacy

Training Data Size Estimation

Why This ML Metric Matters

How Much Training Data Do You Need?

📊 Quick Examples — Click to Load

Inputs

Data Scaling Law Curve (D=20P)

Data-to-Parameter Ratio Comparison

🤖 AI & ML Facts

📋 Key Takeaways

💡 Did You Know

📖 How It Works

1. Pre-training (Chinchilla)

2. Fine-tuning (SFT)

3. Instruction-tuning (LIMA)

4. Beyond Chinchilla

🎯 Expert Tips

Stick to D=20P for pre-training

Quality over quantity

LIMA-style for alignment

Modality matters

⚖️ This Calculator vs. Other Tools

❓ Frequently Asked Questions

How much data do I need for pre-training?

What about fine-tuning?

What did LIMA find?

When should I use Beyond Chinchilla?

Does D=20P apply to code?

How many tokens per example for fine-tuning?

What about instruction-tuning vs fine-tuning?

Where do these numbers come from?

📊 Training Data by the Numbers

📚 Official Sources

Related ML Calculators

Related Calculators

Cross-Validation Sample Size Calculator

Embedding Dimension Calculator

RAG Optimizer Calculator

Activation Memory Calculator

AI Fairness & Bias Calculator

Attention Head Configuration Calculator

We Value Your Privacy