Key-value cache stores attention keys and values during autoregressive generation. Memory = 2 × layers × kv_heads × head_dim × seq × batch × bytes.

MHA: each head has its own KV. MQA: 1 shared KV head. GQA: groups of Q heads share one KV head. GQA balances quality and memory.

How does PagedAttention help?

PagedAttention (vLLM) reduces fragmentation by paging KV cache in blocks. Enables higher throughput and longer sequences.

Why does KV cache dominate memory?

For long sequences, KV cache scales with seq × batch. Model weights are fixed. At 32K context, KV can exceed model size.

FP16 vs FP8 for KV cache?

FP16: 2 bytes/element. FP8: 1 byte. FP8 halves KV memory with minimal quality loss on H100.

How to reduce KV memory?

Use GQA/MQA, FP8, smaller batch, sliding window, or speculative decoding with smaller draft.

5 more

INFERENCEModel Efficiency & DeploymentML Calculator

🧠

KV Cache Memory Estimation

Estimate KV cache memory for LLM inference. MHA, MQA, GQA. PagedAttention, vLLM. Plan memory for Llama, GPT-4, Mistral.

Concept Fundamentals

2×L×d×seq×batch

KV Memory

Key-value cache size

Keys + Values

Stored Per Layer

Autoregressive decoding

Linear with seq_len

Growth

Memory bottleneck

MQA / GQA

Optimization

Multi-query attention

CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: KV cache dominates memory for long-context inference. Formula: 2 × layers × kv_heads × head_dim × seq × batch × bytes.

How: M_KV = 2 × L × H_KV × d × S × B × b. GQA/MQA reduce H_KV. FP8 halves bytes.

●GQA 4–8× smaller than MHA
●PagedAttention in vLLM
●KV can exceed model at 32K+
●FP8 halves memory

Sources:Kwon et al. 2023 - PagedAttention: From Interface to ImplementationShazeer 2019 - Fast Transformer Decoding: One Write-Head is All You Need

🧠

KV CACHE DOMINATES LONG-CONTEXT MEMORY

Estimate KV Cache for MHA, MQA, GQA

Based on PagedAttention (Kwon et al. 2023) and vLLM. Plan memory for Llama, GPT-4, Mistral, Falcon, Gemma.

GPU VRAM →Context Window →

📊 Quick Examples — Click to Load

Inputs

Layerstransformer layers

Q Headsquery heads

KV Headskey/value heads (MQA=1)

Head Dimdim per head

Sequence Lengthtokens

Batch Sizeconcurrent sequences

Precision

Attention Type

Model Params (B)for breakdown chart

kv-cache-estimate.sh

CALCULATED

KV Cache

40.00 GB

Model Weights

130.39 GB

Total

170.39 GB

KV Heads

Bytes

42,949,672,960

KV Cache Size Estimator

KV Cache

40.00 GB

80L GQA×seq=4096×batch=32→Total: 170.4 GB

numbervibe.com/calculators/machine-learning/kv-cache-calculator

KV Cache vs Sequence Length

Memory Breakdown

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

🧠

PagedAttention (Kwon et al. 2023) enables vLLM to serve long-context LLMs with high throughput

— PagedAttention

👁️

GQA (Llama, Mistral) uses 8 KV heads for 32–64 Q heads — 4–8× memory savings vs MHA

— GQA

📦

KV cache can exceed model size at 32K+ context with large batch

— Deployment

⚡

FP8 KV cache on H100 halves memory with minimal quality loss

— TensorRT

📋 Key Takeaways

• KV cache scales linearly with layers, KV heads, head dim, sequence length, and batch size
• GQA and MQA reduce KV cache by sharing keys/values across query heads — MQA uses 1 KV head
• PagedAttention (vLLM) enables non-contiguous KV storage and reduces fragmentation
• FP8/INT8 halves KV cache vs FP16 — trade quality for memory
• Long context (32K+) can dominate memory — KV cache often exceeds model weights
• The factor of 2 in the formula accounts for both Key and Value tensors per layer

💡 Did You Know

📊Llama 3 70B with batch=32, seq=4096 needs ~40GB KV cache alone — often more than model weights

⚡MQA (Falcon) uses 1 KV head — 64× smaller KV cache than MHA for 64-head models

🔧vLLM PagedAttention reduces fragmentation by 2–4× vs naive contiguous allocation

🤗GQA (Llama, Gemma) balances quality and memory — 8:1 ratio common for 70B

🎯FP8 cuts KV cache in half — supported by H100 for inference

📐KV cache = 2 × (K + V) — keys and values each stored per layer per head

🔀Sliding window (Mistral) limits effective KV length — reduces memory for long sequences

📈Batch size × seq length is the main lever — reduce batch for long context

🧠Without KV cache, each token would require recomputing attention over all previous tokens

🌐PagedAttention inspired by OS virtual memory — pages enable flexible allocation

📖 How It Works

1. Attention Types

MHA: each Q head has its own K,V. GQA: Q heads share KV heads (e.g., 8:1). MQA: all Q heads share 1 KV head.

2. KV Cache Formula

2 × layers × KV_heads × head_dim × seq_len × batch × bytes. The 2 accounts for both K and V tensors.

3. PagedAttention

vLLM stores KV cache in non-contiguous memory blocks (pages), reducing fragmentation and enabling efficient sharing.

4. Precision

FP16=2 bytes, FP8/INT8=1 byte. Lower precision halves KV cache with potential quality trade-offs.

5. Scaling

KV cache grows linearly with sequence length — 32K context needs 8× more than 4K.

6. Autoregressive Generation

During decoding, each new token attends to all previous K,V. Caching avoids O(n²) recomputation per step.

🎯 Expert Tips

Use GQA for large models

Llama 70B uses 8:1 GQA — 8× smaller KV cache than MHA with minimal quality loss.

vLLM + PagedAttention

Use vLLM for serving — PagedAttention reduces fragmentation and improves throughput.

Reduce batch for long context

KV cache ∝ batch × seq. For 32K context, use batch=1 or small batches.

FP8 on H100

H100 supports FP8 natively — halves KV cache with minimal latency impact.

⚖️ Attention Type Comparison

Type	KV Heads	Memory	Models
MHA	= Q heads	Highest	GPT-4, Mistral 7B
GQA	< Q heads	Medium	Llama 3, Gemma
MQA	1	Lowest	Falcon, PaLM

❓ Frequently Asked Questions

What is KV cache?

Key-Value cache stores the key and value tensors from previous tokens during autoregressive generation. Each new token attends to all previous K,V — without caching, recomputation would be prohibitive.

Why does GQA reduce memory?

GQA groups query heads to share KV heads. Llama 70B uses 64 Q heads but only 8 KV heads — 8× fewer K,V tensors to store.

What is PagedAttention?

PagedAttention (vLLM) stores KV cache in non-contiguous memory blocks (pages), reducing fragmentation. Similar to OS virtual memory for LLM serving.

When does KV cache exceed model weights?

For long sequences (32K+) and large batch sizes, KV cache can exceed model memory. Llama 70B with batch=32, seq=4096 has ~40GB KV cache vs ~140GB model.

FP8 vs FP16 for KV cache?

FP8 halves memory. H100 supports FP8 natively. Quality impact is usually small for inference. Use for memory-constrained deployments.

How accurate is this calculator?

Formula is exact. Real usage may vary with framework overhead, fragmentation, and implementation. Use for planning and capacity estimation.

MHA vs MQA for inference?

MQA (1 KV head) minimizes memory but can hurt quality. GQA (e.g., 8:1) is a good compromise. MHA is best quality but highest memory.

How to reduce KV cache memory?

Use GQA/MQA, lower precision (FP8/INT8), reduce batch size, use sliding window (Mistral), or chunked attention for very long context.

Why is the formula multiplied by 2?

The 2 accounts for both Key and Value tensors. Each attention head produces separate K and V projections that must be cached.

Does KV cache affect throughput?

Yes. Larger KV cache means more memory bandwidth per token. PagedAttention and continuous batching in vLLM optimize for throughput.

📊 KV Cache by the Numbers

2×

K + V tensors

8:1

GQA ratio (Llama 70B)

MQA KV heads

FP8

Half FP16 memory

📚 Official Sources

Kwon et al. 2023 - PagedAttention: From Interface to Implementation ↗

PagedAttention paper — KV cache memory management for vLLM

Updated: 2023

Shazeer 2019 - Fast Transformer Decoding: One Write-Head is All You Need ↗

Multi-Query Attention (MQA) — shared KV heads

Updated: 2019

Ainslie et al. 2023 - GQA: Training Generalized Multi-Query Transformer Models ↗

Grouped-Query Attention (GQA) — balance MHA and MQA