5 more

TRAININGModel Evaluation & HyperparametersML Calculator

📐

Batch Size & Learning Rate Scaling

Calculate optimal learning rates with linear and sqrt scaling rules. Goyal 2017. Warmup, cosine, linear schedules. ImageNet, BERT, LLM, ViT presets.

Concept Fundamentals

LR ∝ batch_size

Linear Rule

Linear scaling

LR ∝ √batch_size

Sqrt Rule

Square root scaling

Gradual LR increase

Warmup

Training stability

Goyal et al. 2017

Paper

Large-batch training

CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: When you increase batch size, you need to scale LR to maintain convergence. Linear for SGD, sqrt for Adam/AdamW.

How: Linear: η_new = η_base × (B_new/B_base). Sqrt: η_new = η_base × √(B_new/B_base). Warmup ramps LR from 0 to peak.

●Linear for SGD
●Sqrt for Adam
●5–10% warmup typical
●Cosine standard for LLMs

Sources:Goyal et al. 2017 "Accurate, Large Minibatch SGD"Yang et al. 2024 "Power Scheduler"

📐

LR SCALING & SCHEDULES

Batch Size & Learning Rate Calculator

Linear and sqrt scaling rules (Goyal 2017). Warmup, cosine, linear schedules. ImageNet, BERT, LLM, ViT presets.

Gradient Accumulation →LLM Training Cost →

📊 Quick Examples — Click to Load

Inputs

Base LRe.g., 0.1 for SGD, 2e-5 for Adam

Base Batch Sizereference batch

Target Batch Sizenew batch to scale to

Optimizer

Warmup Stepslinear ramp to peak LR

Total Stepsfull training length

Scheduler

lr-schedule.sh

CALCULATED

Linear LR

3.2000e+0

Sqrt LR

5.6569e-1

Recommended LR

3.2000e+0

Scaling Rule

Linear

Batch Size & Learning Rate

Recommended LR

3.2000e+0

Batch 256 → 8192|Linear scaling|cosine schedule

numbervibe.com/calculators/machine-learning/batch-size-learning-rate-calculator

LR Schedule Over Training Steps

Batch–LR Relationship Curve

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

📐

ImageNet ResNet-50: Goyal et al. scaled batch from 256 to 8192 using linear LR scaling with 5K warmup

— Goyal 2017

🤖

LLMs like Llama use cosine decay with warmup — typically 2–5K warmup steps for 70B models

— Llama

⚡

Smith 2018 super-convergence: one-cycle LR can train 10× faster with higher peak LR

— Smith 2018

📊

Adam/AdamW: sqrt scaling often works better than linear for large batch sizes

— Best practice

📋 Key Takeaways

• Linear scaling: LR ∝ batch size — double batch → double LR (Goyal 2017, SGD)
• Sqrt scaling: LR ∝ √batch — use for Adam/AdamW, more conservative
• Warmup prevents instability when scaling to large batches — typically 5–10% of total steps
• Cosine annealing smoothly decays LR to a minimum — standard for LLM pre-training
• Too high LR → divergence; too low → slow convergence — tune with small runs first

💡 Did You Know

📐ImageNet ResNet-50: Goyal et al. scaled batch from 256 to 8192 using linear LR scaling with 5K warmup

🤖LLMs like Llama use cosine decay with warmup — typically 2–5K warmup steps for 70B models

⚡Smith 2018 super-convergence: one-cycle LR can train 10× faster with higher peak LR

🔄SGDR (Loshchilov 2017) adds warm restarts to cosine — helps escape local minima

📊Adam/AdamW: sqrt scaling often works better than linear for large batch sizes

🎯BERT fine-tuning: 2e-5 base LR, 10% warmup, cosine decay is a common recipe

👁️ViT (Dosovitskiy 2021) uses linear decay with long warmup (10K steps) for stability

📉Yang 2024 Power Scheduler generalizes cosine with power-law decay for better control

📖 How It Works

1. Linear Scaling (Goyal 2017)

When you increase batch size k×, increase LR k× to keep the same number of effective updates per epoch. Works well for SGD.

2. Sqrt Scaling

For Adam/AdamW, gradient variance scales as 1/√batch. Sqrt scaling: LR ∝ √(B_new/B_base) is often more stable.

3. Warmup

Start with low LR and linearly ramp to peak over warmup steps. Prevents instability when scaling to large batches.

4. Cosine Annealing

After warmup, decay LR following a cosine curve to a minimum. Smooth decay helps final convergence. Standard in LLM training.

5. Linear Decay

Simple linear decay: LR = η_max × (1 − t/T). Used in ViT and some vision models.

🎯 Expert Tips

Use warmup for large batches

5–10% of total steps. Prevents early divergence when LR is high.

SGD → linear, Adam → sqrt

Match scaling rule to optimizer. Adam benefits from sqrt scaling.

Cosine for LLMs

Cosine decay with min LR ≈ 10% of max is standard for pre-training.

Grid search on small runs

Test 2–3 LR values on 1–5% of data before full training.

⚖️ Scaling Rules Comparison

Rule	Formula	Best For	Batch 256→8192
Linear	LR × (B_new/B_base)	SGD, Goyal-style	32× LR
Sqrt	LR × √(B_new/B_base)	Adam, AdamW	5.66× LR
Constant	LR unchanged	Small batch change	1× LR

❓ Frequently Asked Questions

When to use linear vs sqrt scaling?

Linear: SGD, large-batch ImageNet-style training (Goyal 2017). Sqrt: Adam/AdamW, when linear causes instability.

How long should warmup be?

Typically 5–10% of total steps. For very large batches (8K+), 2–5K steps is common. LLMs often use 2–5K warmup.

What is cosine annealing?

LR decays smoothly following a cosine curve from max to min. From Loshchilov & Hutter 2017 SGDR. Standard for LLM pre-training.

Why does large batch need higher LR?

Larger batch → fewer updates per epoch. To match small-batch behavior, scale LR so effective step size stays similar.

Can I use linear scaling with Adam?

Sometimes, but sqrt is often more stable. Adam has adaptive per-parameter scaling; linear can overshoot.

What is the min LR in cosine decay?

Often 1–10% of max LR. Llama uses 0.1× max. Prevents LR from going to zero for potential fine-tuning.

How does gradient accumulation affect LR?

Effective batch = micro batch × accum steps. Scale LR by effective batch, not micro batch. See Gradient Accumulation Calculator.

What is SGDR warm restarts?

Periodically reset LR to max and restart cosine. Helps escape local minima. From Loshchilov & Hutter 2017.

📊 LR Scaling by the Numbers

32×

Linear 256→8192

5.66×

Sqrt 256→8192

5–10%

Typical Warmup

0.1×

Cosine Min LR

📚 Official Sources

Goyal et al. 2017 "Accurate, Large Minibatch SGD" ↗

Linear scaling rule for batch size and learning rate

Updated: 2017

Yang et al. 2024 "Power Scheduler" ↗

Power scheduler for learning rate schedules

Updated: 2024

Smith 2018 "Cyclical Learning Rates" ↗

Cyclical and super-convergence learning rates

Updated: 2018

Loshchilov & Hutter 2017 "SGDR: Stochastic Gradient Descent with Warm Restarts" ↗

Cosine annealing with warm restarts

Updated: 2017

⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual optimal learning rates depend on model architecture, dataset, and optimizer. Linear and sqrt scaling are heuristics; always validate with small-scale experiments. Warmup and scheduler choices vary by task. For production training, follow established recipes (e.g., Llama, BERT) and tune on your setup.

👈 START HERE

⬅️Jump in and explore the concept!

Batch Size & Learning Rate Scaling

Why This ML Metric Matters

Batch Size & Learning Rate Calculator

📊 Quick Examples — Click to Load

Inputs

LR Schedule Over Training Steps

Batch–LR Relationship Curve

🤖 AI & ML Facts

📋 Key Takeaways

💡 Did You Know

📖 How It Works

1. Linear Scaling (Goyal 2017)

2. Sqrt Scaling

3. Warmup

4. Cosine Annealing

5. Linear Decay

🎯 Expert Tips

Use warmup for large batches

SGD → linear, Adam → sqrt

Cosine for LLMs

Grid search on small runs

⚖️ Scaling Rules Comparison

❓ Frequently Asked Questions

When to use linear vs sqrt scaling?

How long should warmup be?

What is cosine annealing?

Why does large batch need higher LR?

Can I use linear scaling with Adam?

What is the min LR in cosine decay?

How does gradient accumulation affect LR?

What is SGDR warm restarts?

📊 LR Scaling by the Numbers

📚 Official Sources

Related ML Calculators

Related Calculators

Confusion Matrix & Classification Metrics Calculator

Neural Network Parameter Counter

Activation Memory Calculator

AI Fairness & Bias Calculator

Attention Head Configuration Calculator

Compute-Optimal Model Size Calculator (Chinchilla)

We Value Your Privacy