INFERENCEModel Efficiency & DeploymentML Calculator
🧠

KV Cache Memory Estimation

Estimate KV cache memory for LLM inference. MHA, MQA, GQA. PagedAttention, vLLM. Plan memory for Llama, GPT-4, Mistral.

Concept Fundamentals
2×L×d×seq×batch
KV Memory
Key-value cache size
Keys + Values
Stored Per Layer
Autoregressive decoding
Linear with seq_len
Growth
Memory bottleneck
MQA / GQA
Optimization
Multi-query attention
CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: KV cache dominates memory for long-context inference. Formula: 2 × layers × kv_heads × head_dim × seq × batch × bytes.

How: M_KV = 2 × L × H_KV × d × S × B × b. GQA/MQA reduce H_KV. FP8 halves bytes.

  • GQA 4–8× smaller than MHA
  • PagedAttention in vLLM
  • KV can exceed model at 32K+
  • FP8 halves memory
🧠
KV CACHE DOMINATES LONG-CONTEXT MEMORY

Estimate KV Cache for MHA, MQA, GQA

Based on PagedAttention (Kwon et al. 2023) and vLLM. Plan memory for Llama, GPT-4, Mistral, Falcon, Gemma.

📊 Quick Examples — Click to Load

Inputs

transformer layers
query heads
key/value heads (MQA=1)
dim per head
tokens
concurrent sequences
for breakdown chart
kv-cache-estimate.sh
CALCULATED
KV Cache
40.00 GB
Model Weights
130.39 GB
Total
170.39 GB
KV Heads
8
Bytes
42,949,672,960
Share:
KV Cache Size Estimator
KV Cache
40.00 GB
80L GQA×seq=4096×batch=32Total: 170.4 GB
numbervibe.com/calculators/machine-learning/kv-cache-calculator

KV Cache vs Sequence Length

Memory Breakdown

⚠️For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

🧠

PagedAttention (Kwon et al. 2023) enables vLLM to serve long-context LLMs with high throughput

— PagedAttention

👁️

GQA (Llama, Mistral) uses 8 KV heads for 32–64 Q heads — 4–8× memory savings vs MHA

— GQA

📦

KV cache can exceed model size at 32K+ context with large batch

— Deployment

FP8 KV cache on H100 halves memory with minimal quality loss

— TensorRT

📋 Key Takeaways

  • • KV cache scales linearly with layers, KV heads, head dim, sequence length, and batch size
  • • GQA and MQA reduce KV cache by sharing keys/values across query heads — MQA uses 1 KV head
  • • PagedAttention (vLLM) enables non-contiguous KV storage and reduces fragmentation
  • • FP8/INT8 halves KV cache vs FP16 — trade quality for memory
  • • Long context (32K+) can dominate memory — KV cache often exceeds model weights
  • • The factor of 2 in the formula accounts for both Key and Value tensors per layer

💡 Did You Know

📊Llama 3 70B with batch=32, seq=4096 needs ~40GB KV cache alone — often more than model weights
MQA (Falcon) uses 1 KV head — 64× smaller KV cache than MHA for 64-head models
🔧vLLM PagedAttention reduces fragmentation by 2–4× vs naive contiguous allocation
🤗GQA (Llama, Gemma) balances quality and memory — 8:1 ratio common for 70B
🎯FP8 cuts KV cache in half — supported by H100 for inference
📐KV cache = 2 × (K + V) — keys and values each stored per layer per head
🔀Sliding window (Mistral) limits effective KV length — reduces memory for long sequences
📈Batch size × seq length is the main lever — reduce batch for long context
🧠Without KV cache, each token would require recomputing attention over all previous tokens
🌐PagedAttention inspired by OS virtual memory — pages enable flexible allocation

📖 How It Works

1. Attention Types

MHA: each Q head has its own K,V. GQA: Q heads share KV heads (e.g., 8:1). MQA: all Q heads share 1 KV head.

2. KV Cache Formula

2 × layers × KV_heads × head_dim × seq_len × batch × bytes. The 2 accounts for both K and V tensors.

3. PagedAttention

vLLM stores KV cache in non-contiguous memory blocks (pages), reducing fragmentation and enabling efficient sharing.

4. Precision

FP16=2 bytes, FP8/INT8=1 byte. Lower precision halves KV cache with potential quality trade-offs.

5. Scaling

KV cache grows linearly with sequence length — 32K context needs 8× more than 4K.

6. Autoregressive Generation

During decoding, each new token attends to all previous K,V. Caching avoids O(n²) recomputation per step.

🎯 Expert Tips

Use GQA for large models

Llama 70B uses 8:1 GQA — 8× smaller KV cache than MHA with minimal quality loss.

vLLM + PagedAttention

Use vLLM for serving — PagedAttention reduces fragmentation and improves throughput.

Reduce batch for long context

KV cache ∝ batch × seq. For 32K context, use batch=1 or small batches.

FP8 on H100

H100 supports FP8 natively — halves KV cache with minimal latency impact.

⚖️ Attention Type Comparison

TypeKV HeadsMemoryModels
MHA= Q headsHighestGPT-4, Mistral 7B
GQA< Q headsMediumLlama 3, Gemma
MQA1LowestFalcon, PaLM

❓ Frequently Asked Questions

What is KV cache?

Key-Value cache stores the key and value tensors from previous tokens during autoregressive generation. Each new token attends to all previous K,V — without caching, recomputation would be prohibitive.

Why does GQA reduce memory?

GQA groups query heads to share KV heads. Llama 70B uses 64 Q heads but only 8 KV heads — 8× fewer K,V tensors to store.

What is PagedAttention?

PagedAttention (vLLM) stores KV cache in non-contiguous memory blocks (pages), reducing fragmentation. Similar to OS virtual memory for LLM serving.

When does KV cache exceed model weights?

For long sequences (32K+) and large batch sizes, KV cache can exceed model memory. Llama 70B with batch=32, seq=4096 has ~40GB KV cache vs ~140GB model.

FP8 vs FP16 for KV cache?

FP8 halves memory. H100 supports FP8 natively. Quality impact is usually small for inference. Use for memory-constrained deployments.

How accurate is this calculator?

Formula is exact. Real usage may vary with framework overhead, fragmentation, and implementation. Use for planning and capacity estimation.

MHA vs MQA for inference?

MQA (1 KV head) minimizes memory but can hurt quality. GQA (e.g., 8:1) is a good compromise. MHA is best quality but highest memory.

How to reduce KV cache memory?

Use GQA/MQA, lower precision (FP8/INT8), reduce batch size, use sliding window (Mistral), or chunked attention for very long context.

Why is the formula multiplied by 2?

The 2 accounts for both Key and Value tensors. Each attention head produces separate K and V projections that must be cached.

Does KV cache affect throughput?

Yes. Larger KV cache means more memory bandwidth per token. PagedAttention and continuous batching in vLLM optimize for throughput.

📊 KV Cache by the Numbers

K + V tensors
8:1
GQA ratio (Llama 70B)
1
MQA KV heads
FP8
Half FP16 memory

⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. The KV cache formula is standard; actual memory usage depends on framework (vLLM, HuggingFace, etc.), PagedAttention implementation, and hardware. For production, validate with profiling tools and test on target deployment.

👈 START HERE
⬅️Jump in and explore the concept!
AI