OPTIMIZATIONLLM Training & ScalingML Calculator
🎮

GPU VRAM Requirements

Estimate GPU memory for training and inference. Model, optimizer, gradient, activation breakdown. FP32 to INT4.

Concept Fundamentals
params × bytes
Model Memory
Weight storage
2–4× model size
Optimizer
Adam states
∝ batch × seq_len
Activations
Gradient checkpointing
GPU selection guide
Application
VRAM planning
GPU Memory is the #1 BottleneckEstimate VRAM for training & inference

Why This ML Metric Matters

Why: VRAM limits model size and batch size. Proper estimates prevent OOM and inform GPU purchasing.

How: Inference: model × bytes × 1.2. Training: model + optimizer (2×FP32) + gradients + activations.

🎮
GPU MEMORY IS THE #1 BOTTLENECK

Estimate VRAM for Training & Inference

From 7B to 180B — calculate memory for FP32, FP16, BF16, INT8, INT4. Plan your GPU setup before you buy.

📊 Quick Examples — Click to Load

Inputs

e.g., 7 for 7B
samples per step
tokens
vram-estimate.sh
CALCULATED
Total VRAM
86.23 GB
Model
13.04 GB
Optimizer
52.15 GB
Gradients
13.04 GB
Activations
8.00 GB
Fits On
H200
Share:
GPU VRAM Requirements
Total Memory
86.23 GB
7B BF16×trainingFits: H200
numbervibe.com/calculators/machine-learning/gpu-vram-calculator

Memory Breakdown

Precision Comparison

⚠️For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

📊

70B FP16 needs ~140GB VRAM — fits on 2×A100-80GB or 1×H200

— Memory calc

AdamW triples memory vs inference: model + gradients + 2×FP32 optimizer states

— ZeRO paper

🔧

Gradient checkpointing saves 3–5× activation memory at ~30% more compute

— HuggingFace

📐

INT4 uses 0.5 bytes/param — 8× smaller than FP32, enables 70B on single 80GB

— Quantization

📋 Key Takeaways

  • • GPU VRAM is the #1 bottleneck for LLM training and inference
  • • Precision halves memory: FP32→FP16/BF16 (2×), FP16→INT8 (2×), INT8→INT4 (2×)
  • • AdamW optimizer triples memory vs inference — optimizer states stored in FP32
  • • Activations scale with batch size × sequence length — reduce for OOM
  • • Gradient checkpointing trades compute for memory — can cut activation memory 3–5×

💡 Did You Know

📊70B FP16 model needs ~140GB VRAM — fits on 2×A100-80GB or 1×H200
AdamW triples memory vs inference: model + gradients + 2×FP32 optimizer states
🔧Gradient checkpointing recomputes activations during backward — saves 3–5× activation memory
🤗HuggingFace Accelerate has a built-in model size estimator for quick checks
🎯Mixed precision (FP16/BF16) halves model and gradient memory with minimal quality loss
📐INT4 uses 0.5 bytes/param — 8× smaller than FP32, enables 70B on single 80GB GPU
🔀ZeRO sharding splits optimizer states across GPUs — enables training larger models
📈Activations dominate at large batch×seq — reduce batch size first when OOM

📖 How It Works

1. Bytes per Parameter

FP32=4, FP16/BF16=2, INT8=1, INT4=0.5. Model memory = params × bytes.

2. Inference

Model weights + 20% overhead for KV cache, temp buffers, and framework overhead.

3. Training Components

Model + gradients (same precision) + optimizer states (FP32 for AdamW) + activations (batch×seq×hidden×layers).

4. AdamW Overhead

Momentum and variance each = params × 4 bytes. Total optimizer = 2 × params × 4.

5. Activations

Estimated from batch size, sequence length, hidden dim, and layer count. Dominates at large batch×seq.

🎯 Expert Tips

Use mixed precision

FP16/BF16 halves memory with minimal quality loss. Use gradient scaling for stability.

Gradient checkpointing

Recompute activations in backward pass. Cuts activation memory 3–5× at ~20% compute cost.

ZeRO sharding

Split optimizer states and gradients across GPUs. Enables training models that don't fit on one GPU.

Estimate before buying

Run this calculator before purchasing GPUs. 70B FP16 needs 140GB+ — plan for multi-GPU or quantization.

⚖️ Precision Comparison

PrecisionBytes/Param7B Model70B ModelTypical Use
FP32428 GB280 GBFull precision, rare
FP16/BF16214 GB140 GBTraining standard
INT817 GB70 GBQuantized inference
INT40.53.5 GB35 GBExtreme compression

❓ Frequently Asked Questions

How much VRAM does a 70B model need?

FP16: ~140GB. INT8: ~70GB. INT4: ~35GB. For training with AdamW, add optimizer states (~56GB) and gradients (~140GB), plus activations.

Why does AdamW triple memory?

AdamW stores momentum and variance in FP32 (8 bytes/param total). With model + gradients, that's ~3× the inference footprint.

What is gradient checkpointing?

Instead of storing all activations for backward pass, recompute them on demand. Saves 3–5× activation memory at ~20% compute overhead.

Can I run 70B on a single GPU?

Yes, with INT4 quantization (~35GB) on an A100-80GB or H100. FP16 needs 2×A100-80GB or 1×H200 (141GB).

What is ZeRO?

ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and parameters across GPUs to reduce per-GPU memory.

FP16 vs BF16 for training?

Both use 2 bytes/param. BF16 has better numerical range, preferred for training. FP16 is fine for inference.

How accurate is this calculator?

Estimates are within ~20%. Real usage depends on framework, CUDA kernels, and implementation. Use for planning.

What if I get OOM?

Reduce batch size, use gradient accumulation, enable gradient checkpointing, or switch to lower precision / quantization.

📊 GPU VRAM by the Numbers

140GB
70B FP16
AdamW Overhead
80GB
H100 VRAM
0.5
INT4 bytes/param

⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual VRAM usage depends on framework (PyTorch, JAX), CUDA version, model architecture, and implementation details. Activation memory estimates are approximate. For production, validate with profiling tools (nvidia-smi, PyTorch memory profiler) and test on target hardware.

👈 START HERE
⬅️Jump in and explore the concept!
AI