GPU VRAM Requirements
Estimate GPU memory for training and inference. Model, optimizer, gradient, activation breakdown. FP32 to INT4.
Why This ML Metric Matters
Why: VRAM limits model size and batch size. Proper estimates prevent OOM and inform GPU purchasing.
How: Inference: model × bytes × 1.2. Training: model + optimizer (2×FP32) + gradients + activations.
Estimate VRAM for Training & Inference
From 7B to 180B — calculate memory for FP32, FP16, BF16, INT8, INT4. Plan your GPU setup before you buy.
📊 Quick Examples — Click to Load
Inputs
Memory Breakdown
Precision Comparison
⚠️For educational and informational purposes only. Verify with a qualified professional.
🤖 AI & ML Facts
70B FP16 needs ~140GB VRAM — fits on 2×A100-80GB or 1×H200
— Memory calc
AdamW triples memory vs inference: model + gradients + 2×FP32 optimizer states
— ZeRO paper
Gradient checkpointing saves 3–5× activation memory at ~30% more compute
— HuggingFace
INT4 uses 0.5 bytes/param — 8× smaller than FP32, enables 70B on single 80GB
— Quantization
📋 Key Takeaways
- • GPU VRAM is the #1 bottleneck for LLM training and inference
- • Precision halves memory: FP32→FP16/BF16 (2×), FP16→INT8 (2×), INT8→INT4 (2×)
- • AdamW optimizer triples memory vs inference — optimizer states stored in FP32
- • Activations scale with batch size × sequence length — reduce for OOM
- • Gradient checkpointing trades compute for memory — can cut activation memory 3–5×
💡 Did You Know
📖 How It Works
1. Bytes per Parameter
FP32=4, FP16/BF16=2, INT8=1, INT4=0.5. Model memory = params × bytes.
2. Inference
Model weights + 20% overhead for KV cache, temp buffers, and framework overhead.
3. Training Components
Model + gradients (same precision) + optimizer states (FP32 for AdamW) + activations (batch×seq×hidden×layers).
4. AdamW Overhead
Momentum and variance each = params × 4 bytes. Total optimizer = 2 × params × 4.
5. Activations
Estimated from batch size, sequence length, hidden dim, and layer count. Dominates at large batch×seq.
🎯 Expert Tips
Use mixed precision
FP16/BF16 halves memory with minimal quality loss. Use gradient scaling for stability.
Gradient checkpointing
Recompute activations in backward pass. Cuts activation memory 3–5× at ~20% compute cost.
ZeRO sharding
Split optimizer states and gradients across GPUs. Enables training models that don't fit on one GPU.
Estimate before buying
Run this calculator before purchasing GPUs. 70B FP16 needs 140GB+ — plan for multi-GPU or quantization.
⚖️ Precision Comparison
| Precision | Bytes/Param | 7B Model | 70B Model | Typical Use |
|---|---|---|---|---|
| FP32 | 4 | 28 GB | 280 GB | Full precision, rare |
| FP16/BF16 | 2 | 14 GB | 140 GB | Training standard |
| INT8 | 1 | 7 GB | 70 GB | Quantized inference |
| INT4 | 0.5 | 3.5 GB | 35 GB | Extreme compression |
❓ Frequently Asked Questions
How much VRAM does a 70B model need?
FP16: ~140GB. INT8: ~70GB. INT4: ~35GB. For training with AdamW, add optimizer states (~56GB) and gradients (~140GB), plus activations.
Why does AdamW triple memory?
AdamW stores momentum and variance in FP32 (8 bytes/param total). With model + gradients, that's ~3× the inference footprint.
What is gradient checkpointing?
Instead of storing all activations for backward pass, recompute them on demand. Saves 3–5× activation memory at ~20% compute overhead.
Can I run 70B on a single GPU?
Yes, with INT4 quantization (~35GB) on an A100-80GB or H100. FP16 needs 2×A100-80GB or 1×H200 (141GB).
What is ZeRO?
ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and parameters across GPUs to reduce per-GPU memory.
FP16 vs BF16 for training?
Both use 2 bytes/param. BF16 has better numerical range, preferred for training. FP16 is fine for inference.
How accurate is this calculator?
Estimates are within ~20%. Real usage depends on framework, CUDA kernels, and implementation. Use for planning.
What if I get OOM?
Reduce batch size, use gradient accumulation, enable gradient checkpointing, or switch to lower precision / quantization.
📊 GPU VRAM by the Numbers
📚 Official Sources
⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual VRAM usage depends on framework (PyTorch, JAX), CUDA version, model architecture, and implementation details. Activation memory estimates are approximate. For production, validate with profiling tools (nvidia-smi, PyTorch memory profiler) and test on target hardware.