Batch Size & Learning Rate Scaling
Calculate optimal learning rates with linear and sqrt scaling rules. Goyal 2017. Warmup, cosine, linear schedules. ImageNet, BERT, LLM, ViT presets.
Why This ML Metric Matters
Why: When you increase batch size, you need to scale LR to maintain convergence. Linear for SGD, sqrt for Adam/AdamW.
How: Linear: η_new = η_base × (B_new/B_base). Sqrt: η_new = η_base × √(B_new/B_base). Warmup ramps LR from 0 to peak.
- ●Linear for SGD
- ●Sqrt for Adam
- ●5–10% warmup typical
- ●Cosine standard for LLMs
Batch Size & Learning Rate Calculator
Linear and sqrt scaling rules (Goyal 2017). Warmup, cosine, linear schedules. ImageNet, BERT, LLM, ViT presets.
📊 Quick Examples — Click to Load
Inputs
LR Schedule Over Training Steps
Batch–LR Relationship Curve
For educational and informational purposes only. Verify with a qualified professional.
🤖 AI & ML Facts
ImageNet ResNet-50: Goyal et al. scaled batch from 256 to 8192 using linear LR scaling with 5K warmup
— Goyal 2017
LLMs like Llama use cosine decay with warmup — typically 2–5K warmup steps for 70B models
— Llama
Smith 2018 super-convergence: one-cycle LR can train 10× faster with higher peak LR
— Smith 2018
Adam/AdamW: sqrt scaling often works better than linear for large batch sizes
— Best practice
📋 Key Takeaways
- • Linear scaling: LR ∝ batch size — double batch → double LR (Goyal 2017, SGD)
- • Sqrt scaling: LR ∝ √batch — use for Adam/AdamW, more conservative
- • Warmup prevents instability when scaling to large batches — typically 5–10% of total steps
- • Cosine annealing smoothly decays LR to a minimum — standard for LLM pre-training
- • Too high LR → divergence; too low → slow convergence — tune with small runs first
💡 Did You Know
📖 How It Works
1. Linear Scaling (Goyal 2017)
When you increase batch size k×, increase LR k× to keep the same number of effective updates per epoch. Works well for SGD.
2. Sqrt Scaling
For Adam/AdamW, gradient variance scales as 1/√batch. Sqrt scaling: LR ∝ √(B_new/B_base) is often more stable.
3. Warmup
Start with low LR and linearly ramp to peak over warmup steps. Prevents instability when scaling to large batches.
4. Cosine Annealing
After warmup, decay LR following a cosine curve to a minimum. Smooth decay helps final convergence. Standard in LLM training.
5. Linear Decay
Simple linear decay: LR = η_max × (1 − t/T). Used in ViT and some vision models.
🎯 Expert Tips
Use warmup for large batches
5–10% of total steps. Prevents early divergence when LR is high.
SGD → linear, Adam → sqrt
Match scaling rule to optimizer. Adam benefits from sqrt scaling.
Cosine for LLMs
Cosine decay with min LR ≈ 10% of max is standard for pre-training.
Grid search on small runs
Test 2–3 LR values on 1–5% of data before full training.
⚖️ Scaling Rules Comparison
| Rule | Formula | Best For | Batch 256→8192 |
|---|---|---|---|
| Linear | LR × (B_new/B_base) | SGD, Goyal-style | 32× LR |
| Sqrt | LR × √(B_new/B_base) | Adam, AdamW | 5.66× LR |
| Constant | LR unchanged | Small batch change | 1× LR |
❓ Frequently Asked Questions
When to use linear vs sqrt scaling?
Linear: SGD, large-batch ImageNet-style training (Goyal 2017). Sqrt: Adam/AdamW, when linear causes instability.
How long should warmup be?
Typically 5–10% of total steps. For very large batches (8K+), 2–5K steps is common. LLMs often use 2–5K warmup.
What is cosine annealing?
LR decays smoothly following a cosine curve from max to min. From Loshchilov & Hutter 2017 SGDR. Standard for LLM pre-training.
Why does large batch need higher LR?
Larger batch → fewer updates per epoch. To match small-batch behavior, scale LR so effective step size stays similar.
Can I use linear scaling with Adam?
Sometimes, but sqrt is often more stable. Adam has adaptive per-parameter scaling; linear can overshoot.
What is the min LR in cosine decay?
Often 1–10% of max LR. Llama uses 0.1× max. Prevents LR from going to zero for potential fine-tuning.
How does gradient accumulation affect LR?
Effective batch = micro batch × accum steps. Scale LR by effective batch, not micro batch. See Gradient Accumulation Calculator.
What is SGDR warm restarts?
Periodically reset LR to max and restart cosine. Helps escape local minima. From Loshchilov & Hutter 2017.
📊 LR Scaling by the Numbers
📚 Official Sources
⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual optimal learning rates depend on model architecture, dataset, and optimizer. Linear and sqrt scaling are heuristics; always validate with small-scale experiments. Warmup and scheduler choices vary by task. For production training, follow established recipes (e.g., Llama, BERT) and tune on your setup.
Related Calculators
Confusion Matrix & Classification Metrics Calculator
Compute Accuracy, Precision, Recall, F1, MCC, Specificity, and ROC-AUC from confusion matrix values.
Machine LearningNeural Network Parameter Counter
Count total parameters for neural network architectures. Supports Linear, Conv2D, Embedding, LayerNorm, and MultiHeadAttention layers.
Machine LearningActivation Memory Calculator
Estimate activation memory with and without gradient checkpointing. Based on NVIDIA selective recomputation research.
Machine LearningAI Fairness & Bias Calculator
Calculate demographic parity, equalized odds, equal opportunity, and disparate impact ratio. Based on IBM AIF360 and Microsoft Fairlearn.
Machine LearningAttention Head Configuration Calculator
Configure MHA, MQA, and GQA attention. Calculate head counts, dimensions, KV cache savings, and memory per attention type.
Machine LearningCompute-Optimal Model Size Calculator (Chinchilla)
Find the compute-optimal model size and training tokens given a compute budget using Chinchilla scaling laws.
Machine Learning