TRAININGModel Evaluation & HyperparametersML Calculator
📐

Batch Size & Learning Rate Scaling

Calculate optimal learning rates with linear and sqrt scaling rules. Goyal 2017. Warmup, cosine, linear schedules. ImageNet, BERT, LLM, ViT presets.

Concept Fundamentals
LR ∝ batch_size
Linear Rule
Linear scaling
LR ∝ √batch_size
Sqrt Rule
Square root scaling
Gradual LR increase
Warmup
Training stability
Goyal et al. 2017
Paper
Large-batch training
CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: When you increase batch size, you need to scale LR to maintain convergence. Linear for SGD, sqrt for Adam/AdamW.

How: Linear: η_new = η_base × (B_new/B_base). Sqrt: η_new = η_base × √(B_new/B_base). Warmup ramps LR from 0 to peak.

  • Linear for SGD
  • Sqrt for Adam
  • 5–10% warmup typical
  • Cosine standard for LLMs
📐
LR SCALING & SCHEDULES

Batch Size & Learning Rate Calculator

Linear and sqrt scaling rules (Goyal 2017). Warmup, cosine, linear schedules. ImageNet, BERT, LLM, ViT presets.

📊 Quick Examples — Click to Load

Inputs

e.g., 0.1 for SGD, 2e-5 for Adam
reference batch
new batch to scale to
linear ramp to peak LR
full training length
lr-schedule.sh
CALCULATED
Linear LR
3.2000e+0
Sqrt LR
5.6569e-1
Recommended LR
3.2000e+0
Scaling Rule
Linear
Share:
Batch Size & Learning Rate
Recommended LR
3.2000e+0
Batch 256 → 8192|Linear scaling|cosine schedule
numbervibe.com/calculators/machine-learning/batch-size-learning-rate-calculator

LR Schedule Over Training Steps

Batch–LR Relationship Curve

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

📐

ImageNet ResNet-50: Goyal et al. scaled batch from 256 to 8192 using linear LR scaling with 5K warmup

— Goyal 2017

🤖

LLMs like Llama use cosine decay with warmup — typically 2–5K warmup steps for 70B models

— Llama

Smith 2018 super-convergence: one-cycle LR can train 10× faster with higher peak LR

— Smith 2018

📊

Adam/AdamW: sqrt scaling often works better than linear for large batch sizes

— Best practice

📋 Key Takeaways

  • • Linear scaling: LR ∝ batch size — double batch → double LR (Goyal 2017, SGD)
  • • Sqrt scaling: LR ∝ √batch — use for Adam/AdamW, more conservative
  • • Warmup prevents instability when scaling to large batches — typically 5–10% of total steps
  • • Cosine annealing smoothly decays LR to a minimum — standard for LLM pre-training
  • • Too high LR → divergence; too low → slow convergence — tune with small runs first

💡 Did You Know

📐ImageNet ResNet-50: Goyal et al. scaled batch from 256 to 8192 using linear LR scaling with 5K warmup
🤖LLMs like Llama use cosine decay with warmup — typically 2–5K warmup steps for 70B models
Smith 2018 super-convergence: one-cycle LR can train 10× faster with higher peak LR
🔄SGDR (Loshchilov 2017) adds warm restarts to cosine — helps escape local minima
📊Adam/AdamW: sqrt scaling often works better than linear for large batch sizes
🎯BERT fine-tuning: 2e-5 base LR, 10% warmup, cosine decay is a common recipe
👁️ViT (Dosovitskiy 2021) uses linear decay with long warmup (10K steps) for stability
📉Yang 2024 Power Scheduler generalizes cosine with power-law decay for better control

📖 How It Works

1. Linear Scaling (Goyal 2017)

When you increase batch size k×, increase LR k× to keep the same number of effective updates per epoch. Works well for SGD.

2. Sqrt Scaling

For Adam/AdamW, gradient variance scales as 1/√batch. Sqrt scaling: LR ∝ √(B_new/B_base) is often more stable.

3. Warmup

Start with low LR and linearly ramp to peak over warmup steps. Prevents instability when scaling to large batches.

4. Cosine Annealing

After warmup, decay LR following a cosine curve to a minimum. Smooth decay helps final convergence. Standard in LLM training.

5. Linear Decay

Simple linear decay: LR = η_max × (1 − t/T). Used in ViT and some vision models.

🎯 Expert Tips

Use warmup for large batches

5–10% of total steps. Prevents early divergence when LR is high.

SGD → linear, Adam → sqrt

Match scaling rule to optimizer. Adam benefits from sqrt scaling.

Cosine for LLMs

Cosine decay with min LR ≈ 10% of max is standard for pre-training.

Grid search on small runs

Test 2–3 LR values on 1–5% of data before full training.

⚖️ Scaling Rules Comparison

RuleFormulaBest ForBatch 256→8192
LinearLR × (B_new/B_base)SGD, Goyal-style32× LR
SqrtLR × √(B_new/B_base)Adam, AdamW5.66× LR
ConstantLR unchangedSmall batch change1× LR

❓ Frequently Asked Questions

When to use linear vs sqrt scaling?

Linear: SGD, large-batch ImageNet-style training (Goyal 2017). Sqrt: Adam/AdamW, when linear causes instability.

How long should warmup be?

Typically 5–10% of total steps. For very large batches (8K+), 2–5K steps is common. LLMs often use 2–5K warmup.

What is cosine annealing?

LR decays smoothly following a cosine curve from max to min. From Loshchilov & Hutter 2017 SGDR. Standard for LLM pre-training.

Why does large batch need higher LR?

Larger batch → fewer updates per epoch. To match small-batch behavior, scale LR so effective step size stays similar.

Can I use linear scaling with Adam?

Sometimes, but sqrt is often more stable. Adam has adaptive per-parameter scaling; linear can overshoot.

What is the min LR in cosine decay?

Often 1–10% of max LR. Llama uses 0.1× max. Prevents LR from going to zero for potential fine-tuning.

How does gradient accumulation affect LR?

Effective batch = micro batch × accum steps. Scale LR by effective batch, not micro batch. See Gradient Accumulation Calculator.

What is SGDR warm restarts?

Periodically reset LR to max and restart cosine. Helps escape local minima. From Loshchilov & Hutter 2017.

📊 LR Scaling by the Numbers

32×
Linear 256→8192
5.66×
Sqrt 256→8192
5–10%
Typical Warmup
0.1×
Cosine Min LR

⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual optimal learning rates depend on model architecture, dataset, and optimizer. Linear and sqrt scaling are heuristics; always validate with small-scale experiments. Warmup and scheduler choices vary by task. For production training, follow established recipes (e.g., Llama, BERT) and tune on your setup.

👈 START HERE
⬅️Jump in and explore the concept!
AI

Related Calculators