KV Cache Memory Estimation
Estimate KV cache memory for LLM inference. MHA, MQA, GQA. PagedAttention, vLLM. Plan memory for Llama, GPT-4, Mistral.
Why This ML Metric Matters
Why: KV cache dominates memory for long-context inference. Formula: 2 × layers × kv_heads × head_dim × seq × batch × bytes.
How: M_KV = 2 × L × H_KV × d × S × B × b. GQA/MQA reduce H_KV. FP8 halves bytes.
- ●GQA 4–8× smaller than MHA
- ●PagedAttention in vLLM
- ●KV can exceed model at 32K+
- ●FP8 halves memory
Estimate KV Cache for MHA, MQA, GQA
Based on PagedAttention (Kwon et al. 2023) and vLLM. Plan memory for Llama, GPT-4, Mistral, Falcon, Gemma.
📊 Quick Examples — Click to Load
Inputs
KV Cache vs Sequence Length
Memory Breakdown
⚠️For educational and informational purposes only. Verify with a qualified professional.
🤖 AI & ML Facts
PagedAttention (Kwon et al. 2023) enables vLLM to serve long-context LLMs with high throughput
— PagedAttention
GQA (Llama, Mistral) uses 8 KV heads for 32–64 Q heads — 4–8× memory savings vs MHA
— GQA
KV cache can exceed model size at 32K+ context with large batch
— Deployment
FP8 KV cache on H100 halves memory with minimal quality loss
— TensorRT
📋 Key Takeaways
- • KV cache scales linearly with layers, KV heads, head dim, sequence length, and batch size
- • GQA and MQA reduce KV cache by sharing keys/values across query heads — MQA uses 1 KV head
- • PagedAttention (vLLM) enables non-contiguous KV storage and reduces fragmentation
- • FP8/INT8 halves KV cache vs FP16 — trade quality for memory
- • Long context (32K+) can dominate memory — KV cache often exceeds model weights
- • The factor of 2 in the formula accounts for both Key and Value tensors per layer
💡 Did You Know
📖 How It Works
1. Attention Types
MHA: each Q head has its own K,V. GQA: Q heads share KV heads (e.g., 8:1). MQA: all Q heads share 1 KV head.
2. KV Cache Formula
2 × layers × KV_heads × head_dim × seq_len × batch × bytes. The 2 accounts for both K and V tensors.
3. PagedAttention
vLLM stores KV cache in non-contiguous memory blocks (pages), reducing fragmentation and enabling efficient sharing.
4. Precision
FP16=2 bytes, FP8/INT8=1 byte. Lower precision halves KV cache with potential quality trade-offs.
5. Scaling
KV cache grows linearly with sequence length — 32K context needs 8× more than 4K.
6. Autoregressive Generation
During decoding, each new token attends to all previous K,V. Caching avoids O(n²) recomputation per step.
🎯 Expert Tips
Use GQA for large models
Llama 70B uses 8:1 GQA — 8× smaller KV cache than MHA with minimal quality loss.
vLLM + PagedAttention
Use vLLM for serving — PagedAttention reduces fragmentation and improves throughput.
Reduce batch for long context
KV cache ∝ batch × seq. For 32K context, use batch=1 or small batches.
FP8 on H100
H100 supports FP8 natively — halves KV cache with minimal latency impact.
⚖️ Attention Type Comparison
| Type | KV Heads | Memory | Models |
|---|---|---|---|
| MHA | = Q heads | Highest | GPT-4, Mistral 7B |
| GQA | < Q heads | Medium | Llama 3, Gemma |
| MQA | 1 | Lowest | Falcon, PaLM |
❓ Frequently Asked Questions
What is KV cache?
Key-Value cache stores the key and value tensors from previous tokens during autoregressive generation. Each new token attends to all previous K,V — without caching, recomputation would be prohibitive.
Why does GQA reduce memory?
GQA groups query heads to share KV heads. Llama 70B uses 64 Q heads but only 8 KV heads — 8× fewer K,V tensors to store.
What is PagedAttention?
PagedAttention (vLLM) stores KV cache in non-contiguous memory blocks (pages), reducing fragmentation. Similar to OS virtual memory for LLM serving.
When does KV cache exceed model weights?
For long sequences (32K+) and large batch sizes, KV cache can exceed model memory. Llama 70B with batch=32, seq=4096 has ~40GB KV cache vs ~140GB model.
FP8 vs FP16 for KV cache?
FP8 halves memory. H100 supports FP8 natively. Quality impact is usually small for inference. Use for memory-constrained deployments.
How accurate is this calculator?
Formula is exact. Real usage may vary with framework overhead, fragmentation, and implementation. Use for planning and capacity estimation.
MHA vs MQA for inference?
MQA (1 KV head) minimizes memory but can hurt quality. GQA (e.g., 8:1) is a good compromise. MHA is best quality but highest memory.
How to reduce KV cache memory?
Use GQA/MQA, lower precision (FP8/INT8), reduce batch size, use sliding window (Mistral), or chunked attention for very long context.
Why is the formula multiplied by 2?
The 2 accounts for both Key and Value tensors. Each attention head produces separate K and V projections that must be cached.
Does KV cache affect throughput?
Yes. Larger KV cache means more memory bandwidth per token. PagedAttention and continuous batching in vLLM optimize for throughput.
📊 KV Cache by the Numbers
📚 Official Sources
⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. The KV cache formula is standard; actual memory usage depends on framework (vLLM, HuggingFace, etc.), PagedAttention implementation, and hardware. For production, validate with profiling tools and test on target deployment.