Attention Head Configuration
Configure MHA, MQA, and GQA attention. Based on Vaswani 2017, Shazeer 2019 MQA, Ainslie 2023 GQA. Design attention for Llama, GPT, Falcon, Mistral, BERT.
Why This ML Metric Matters
Why: GQA reduces KV cache 8× vs MHA (Llama 70B). MQA minimizes memory (Falcon). Attention params = 4×model_dim² — head count affects KV cache, not params.
How: model_dim = Q_heads × head_dim. MHA: KV=Q. GQA: KV<Q, groups share. MQA: KV=1. Savings = Q_heads / KV_heads.
- ●GQA 8:1 Llama 70B
- ●MQA 1 KV head
- ●Params = 4d²
- ●KV cache scales with KV heads
Configure Attention Architecture
Based on Vaswani 2017, Shazeer 2019 MQA, Ainslie 2023 GQA. Design attention for Llama, GPT, Falcon, Mistral, BERT.
📊 Quick Examples — Click to Load
Inputs
Memory Comparison Across Attention Types
Parameter Comparison (Grouped Bar)
KV Heads: Used vs Saved
For educational and informational purposes only. Verify with a qualified professional.
🤖 AI & ML Facts
Llama 3 70B uses 64 Q heads but only 8 KV heads (8:1 GQA) — 8× smaller KV cache than MHA
— Llama
Falcon 40B uses MQA with 1 KV head — 64× smaller KV cache than MHA for 64-head models
— Falcon
Attention params = 4 × model_dim² per layer — independent of head count
— Vaswani
GQA was introduced to bridge the quality gap between MHA and MQA
— Ainslie 2023
📋 Key Takeaways
- • MHA: each Q head has its own K,V — best quality, highest memory
- • GQA: Q heads share KV heads in groups — balances quality and memory (Llama, Mistral)
- • MQA: single shared KV head — lowest memory, may hurt quality (Falcon)
- • KV cache savings = Q heads / KV heads — GQA 8:1 gives 8× smaller KV cache
- • Attention params = 4 × model_dim² per layer — independent of head count
- • model_dim = num_Q_heads × head_dim — must be consistent
💡 Did You Know
📖 How It Works
1. Multi-Head Attention (MHA)
Each of H_Q query heads has its own K,V. Q=K=V=H_Q. Full expressiveness, highest KV cache.
2. Grouped-Query Attention (GQA)
Q heads are grouped; each group shares one KV head. H_KV < H_Q. Balance of quality and memory.
3. Multi-Query Attention (MQA)
All Q heads share a single KV head. H_KV = 1. Minimal KV cache, may reduce quality.
4. Attention Parameters
Q,K,V,O projections each have d_model × d_model params. Total = 4 × d_model² per layer.
5. KV Cache Savings
Savings = H_Q / H_KV. GQA 8:1 → 8× smaller KV cache than MHA.
🎯 Expert Tips
Prefer GQA for large models
8:1 or 4:1 GQA balances quality and memory. Llama, Mistral, Gemma use GQA.
MQA for inference speed
Falcon uses MQA — minimal KV cache. Use when memory is critical.
head_dim × num_heads = model_dim
Keep model_dim consistent. Common: 128 or 256 head_dim.
Pair with KV Cache Calculator
Use KV Cache Calculator to estimate memory for your attention config.
⚖️ Attention Type Comparison
| Type | KV Heads | Memory | Models |
|---|---|---|---|
| MHA | = Q heads | Highest | GPT-3, BERT |
| GQA | < Q heads | Medium | Llama 3, Mistral, Gemma |
| MQA | 1 | Lowest | Falcon, PaLM |
❓ Frequently Asked Questions
What is MHA vs GQA vs MQA?
MHA: each Q head has its own K,V. GQA: Q heads share KV heads in groups. MQA: all Q heads share 1 KV head. GQA balances quality and memory.
Why does GQA reduce memory?
KV cache stores K,V per KV head. GQA uses fewer KV heads, so less memory. Llama 70B: 64 Q heads, 8 KV heads → 8× smaller KV cache.
Does head count affect parameter count?
No. Attention params = 4 × model_dim² per layer. model_dim = num_Q_heads × head_dim. Head count affects KV cache, not param count.
What is a good GQA ratio?
8:1 (e.g., 64 Q / 8 KV) is common for 70B models. 4:1 for smaller models. Higher ratio = more memory savings, potential quality trade-off.
When to use MQA?
When inference memory/speed is critical and quality loss is acceptable. Falcon uses MQA. GQA is usually preferred for new models.
How is model_dim related to heads?
model_dim = num_Q_heads × head_dim. E.g., 64 heads × 128 dim = 8192 model_dim.
📊 Attention by the Numbers
📚 Official Sources
⚠️ Disclaimer: This calculator provides estimates for educational and architecture design. Actual model behavior depends on training, data, and implementation. Use with KV Cache and GPU VRAM calculators for deployment planning.
Related Calculators
Mixture of Experts (MoE) Efficiency Calculator
Calculate total vs active parameters for MoE models. Compare Mixtral, DeepSeek, and Switch Transformer architectures.
Machine LearningKV Cache Size Estimator
Calculate KV cache memory for LLM inference with MHA, MQA, and GQA attention types. Based on PagedAttention research.
Machine LearningActivation Memory Calculator
Estimate activation memory with and without gradient checkpointing. Based on NVIDIA selective recomputation research.
Machine LearningAI Fairness & Bias Calculator
Calculate demographic parity, equalized odds, equal opportunity, and disparate impact ratio. Based on IBM AIF360 and Microsoft Fairlearn.
Machine LearningBatch Size & Learning Rate Calculator
Calculate optimal learning rates using linear and square root scaling rules. Visualize warmup and cosine/linear schedules.
Machine LearningCompute-Optimal Model Size Calculator (Chinchilla)
Find the compute-optimal model size and training tokens given a compute budget using Chinchilla scaling laws.
Machine Learning