How much VRAM for 70B model?

FP16: ~140GB. INT4: ~35GB. Fits on 2×A100-80GB or 1×H200 for FP16.

Why does training need more memory than inference?

Training: model + gradients + optimizer states (AdamW = 2×FP32). Inference: model + ~20% overhead.

INT4 uses 0.5 bytes/param vs 2 for FP16. 4× smaller — enables 70B on single 80GB GPU.

5 more

OPTIMIZATIONLLM Training & ScalingML Calculator

🎮

GPU VRAM Requirements

Estimate GPU memory for training and inference. Model, optimizer, gradient, activation breakdown. FP32 to INT4.

Concept Fundamentals

params × bytes

Model Memory

Weight storage

2–4× model size

Optimizer

Adam states

∝ batch × seq_len

Activations

Gradient checkpointing

GPU selection guide

Application

VRAM planning

GPU Memory is the #1 BottleneckEstimate VRAM for training & inference

Why This ML Metric Matters

Why: VRAM limits model size and batch size. Proper estimates prevent OOM and inform GPU purchasing.

How: Inference: model × bytes × 1.2. Training: model + optimizer (2×FP32) + gradients + activations.

Sources:HuggingFace Accelerate - Model Size EstimatorZeRO: Memory Optimizations (Rajbhandari et al.)

🎮

GPU MEMORY IS THE #1 BOTTLENECK

Estimate VRAM for Training & Inference

From 7B to 180B — calculate memory for FP32, FP16, BF16, INT8, INT4. Plan your GPU setup before you buy.

LLM Training Cost →Quantization →

📊 Quick Examples — Click to Load

Inputs

Model Params (B)e.g., 7 for 7B

Precision

Mode

Batch Sizesamples per step

Sequence Lengthtokens

Optimizer

vram-estimate.sh

CALCULATED

Total VRAM

86.23 GB

Model

13.04 GB

Optimizer

52.15 GB

Gradients

13.04 GB

Activations

8.00 GB

Fits On

H200

GPU VRAM Requirements

Total Memory

86.23 GB

7B BF16×training→Fits: H200

numbervibe.com/calculators/machine-learning/gpu-vram-calculator

Memory Breakdown

Precision Comparison

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

📊

70B FP16 needs ~140GB VRAM — fits on 2×A100-80GB or 1×H200

— Memory calc

⚡

AdamW triples memory vs inference: model + gradients + 2×FP32 optimizer states

— ZeRO paper

🔧

Gradient checkpointing saves 3–5× activation memory at ~30% more compute

— HuggingFace

📐

INT4 uses 0.5 bytes/param — 8× smaller than FP32, enables 70B on single 80GB

— Quantization

📋 Key Takeaways

• GPU VRAM is the #1 bottleneck for LLM training and inference
• Precision halves memory: FP32→FP16/BF16 (2×), FP16→INT8 (2×), INT8→INT4 (2×)
• AdamW optimizer triples memory vs inference — optimizer states stored in FP32
• Activations scale with batch size × sequence length — reduce for OOM
• Gradient checkpointing trades compute for memory — can cut activation memory 3–5×

💡 Did You Know

📊70B FP16 model needs ~140GB VRAM — fits on 2×A100-80GB or 1×H200

⚡AdamW triples memory vs inference: model + gradients + 2×FP32 optimizer states

🔧Gradient checkpointing recomputes activations during backward — saves 3–5× activation memory

🤗HuggingFace Accelerate has a built-in model size estimator for quick checks

🎯Mixed precision (FP16/BF16) halves model and gradient memory with minimal quality loss

📐INT4 uses 0.5 bytes/param — 8× smaller than FP32, enables 70B on single 80GB GPU

🔀ZeRO sharding splits optimizer states across GPUs — enables training larger models

📈Activations dominate at large batch×seq — reduce batch size first when OOM

📖 How It Works

1. Bytes per Parameter

FP32=4, FP16/BF16=2, INT8=1, INT4=0.5. Model memory = params × bytes.

2. Inference

Model weights + 20% overhead for KV cache, temp buffers, and framework overhead.

3. Training Components

Model + gradients (same precision) + optimizer states (FP32 for AdamW) + activations (batch×seq×hidden×layers).

4. AdamW Overhead

Momentum and variance each = params × 4 bytes. Total optimizer = 2 × params × 4.

5. Activations

Estimated from batch size, sequence length, hidden dim, and layer count. Dominates at large batch×seq.

🎯 Expert Tips

Use mixed precision

FP16/BF16 halves memory with minimal quality loss. Use gradient scaling for stability.

Gradient checkpointing

Recompute activations in backward pass. Cuts activation memory 3–5× at ~20% compute cost.

ZeRO sharding

Split optimizer states and gradients across GPUs. Enables training models that don't fit on one GPU.

Estimate before buying

Run this calculator before purchasing GPUs. 70B FP16 needs 140GB+ — plan for multi-GPU or quantization.

⚖️ Precision Comparison

Precision	Bytes/Param	7B Model	70B Model	Typical Use
FP32	4	28 GB	280 GB	Full precision, rare
FP16/BF16	2	14 GB	140 GB	Training standard
INT8	1	7 GB	70 GB	Quantized inference
INT4	0.5	3.5 GB	35 GB	Extreme compression

❓ Frequently Asked Questions

How much VRAM does a 70B model need?

FP16: ~140GB. INT8: ~70GB. INT4: ~35GB. For training with AdamW, add optimizer states (~56GB) and gradients (~140GB), plus activations.

Why does AdamW triple memory?

AdamW stores momentum and variance in FP32 (8 bytes/param total). With model + gradients, that's ~3× the inference footprint.

What is gradient checkpointing?

Instead of storing all activations for backward pass, recompute them on demand. Saves 3–5× activation memory at ~20% compute overhead.

Can I run 70B on a single GPU?

Yes, with INT4 quantization (~35GB) on an A100-80GB or H100. FP16 needs 2×A100-80GB or 1×H200 (141GB).

What is ZeRO?

ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and parameters across GPUs to reduce per-GPU memory.

FP16 vs BF16 for training?

Both use 2 bytes/param. BF16 has better numerical range, preferred for training. FP16 is fine for inference.

How accurate is this calculator?

Estimates are within ~20%. Real usage depends on framework, CUDA kernels, and implementation. Use for planning.

What if I get OOM?

Reduce batch size, use gradient accumulation, enable gradient checkpointing, or switch to lower precision / quantization.

📊 GPU VRAM by the Numbers

140GB

70B FP16

3×

AdamW Overhead

80GB

H100 VRAM

0.5

INT4 bytes/param

📚 Official Sources

HuggingFace Accelerate - Model Size Estimator ↗

Official HuggingFace tool for estimating model memory

Updated: 2024

ZeRO: Memory Optimizations (Rajbhandari et al.) ↗

ZeRO paper - optimizer state partitioning

Updated: 2019

NVIDIA GPU Specifications ↗

Official GPU VRAM specs (A100, H100, etc.)

Updated: 2024

BestGPUsForAI - Simple LLM VRAM Calculator ↗

Inference VRAM calculator reference

Updated: 2024

⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual VRAM usage depends on framework (PyTorch, JAX), CUDA version, model architecture, and implementation details. Activation memory estimates are approximate. For production, validate with profiling tools (nvidia-smi, PyTorch memory profiler) and test on target hardware.

👈 START HERE

⬅️Jump in and explore the concept!

GPU VRAM Requirements

Why This ML Metric Matters

Estimate VRAM for Training & Inference

📊 Quick Examples — Click to Load

Inputs

Memory Breakdown

Precision Comparison

🤖 AI & ML Facts

📋 Key Takeaways

💡 Did You Know

📖 How It Works

1. Bytes per Parameter

2. Inference

3. Training Components

4. AdamW Overhead

5. Activations

🎯 Expert Tips

Use mixed precision

Gradient checkpointing

ZeRO sharding

Estimate before buying

⚖️ Precision Comparison

❓ Frequently Asked Questions

How much VRAM does a 70B model need?

Why does AdamW triple memory?

What is gradient checkpointing?

Can I run 70B on a single GPU?

What is ZeRO?

FP16 vs BF16 for training?

How accurate is this calculator?

What if I get OOM?

📊 GPU VRAM by the Numbers

📚 Official Sources

Related ML Calculators

Related Calculators

Compute-Optimal Model Size Calculator (Chinchilla)

Token Count & LLM API Cost Calculator

LLM Training Cost Estimator

LoRA / QLoRA Fine-Tuning Parameter Calculator

Neural Network FLOPs Calculator

Activation Memory Calculator

We Value Your Privacy