OPTIMIZATIONArchitecture & DesignML Calculator
🧠

Attention Head Configuration

Configure MHA, MQA, and GQA attention. Based on Vaswani 2017, Shazeer 2019 MQA, Ainslie 2023 GQA. Design attention for Llama, GPT, Falcon, Mistral, BERT.

Concept Fundamentals
d_k = d_model / n_heads
Head Dimension
Per-head size
Concat(head_i) × W_O
Multi-Head
Parallel attention
3 projections per head
QKV
Query, Key, Value
Key-value sharing
MQA/GQA
Memory optimization
CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: GQA reduces KV cache 8× vs MHA (Llama 70B). MQA minimizes memory (Falcon). Attention params = 4×model_dim² — head count affects KV cache, not params.

How: model_dim = Q_heads × head_dim. MHA: KV=Q. GQA: KV<Q, groups share. MQA: KV=1. Savings = Q_heads / KV_heads.

  • GQA 8:1 Llama 70B
  • MQA 1 KV head
  • Params = 4d²
  • KV cache scales with KV heads
🧠
MHA • GQA • MQA

Configure Attention Architecture

Based on Vaswani 2017, Shazeer 2019 MQA, Ainslie 2023 GQA. Design attention for Llama, GPT, Falcon, Mistral, BERT.

📊 Quick Examples — Click to Load

Inputs

query heads
key/value heads (MQA=1)
dim per head
transformer layers
attention-config.sh
CALCULATED
Model Dim
8,192
Attn Params (B)
21.47
KV Savings
8.0×
KV Heads
8
Q per KV
8
Type
GQA
Share:
Attention Head Configuration
KV Cache Savings
8.0×
GQA×64Q/8KVModel Dim: 8192
numbervibe.com/calculators/machine-learning/attention-head-calculator

Memory Comparison Across Attention Types

Parameter Comparison (Grouped Bar)

KV Heads: Used vs Saved

1. Model Dimension
d_{model} = H_Q × d_{head} = 64 × 128 = 8192
2. Attention Params per Layer
P_{attn} = 4 × d_{model}^2 = 4 × 8192^2 = 268,435,456
3. KV Cache Savings
Savings = \frac{H_Q}{H_{KV}} = \frac{64}{8} = 8.0 ×

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

🦙

Llama 3 70B uses 64 Q heads but only 8 KV heads (8:1 GQA) — 8× smaller KV cache than MHA

— Llama

🦅

Falcon 40B uses MQA with 1 KV head — 64× smaller KV cache than MHA for 64-head models

— Falcon

📐

Attention params = 4 × model_dim² per layer — independent of head count

— Vaswani

GQA was introduced to bridge the quality gap between MHA and MQA

— Ainslie 2023

📋 Key Takeaways

  • • MHA: each Q head has its own K,V — best quality, highest memory
  • • GQA: Q heads share KV heads in groups — balances quality and memory (Llama, Mistral)
  • • MQA: single shared KV head — lowest memory, may hurt quality (Falcon)
  • • KV cache savings = Q heads / KV heads — GQA 8:1 gives 8× smaller KV cache
  • • Attention params = 4 × model_dim² per layer — independent of head count
  • • model_dim = num_Q_heads × head_dim — must be consistent

💡 Did You Know

🦙Llama 3 70B uses 64 Q heads but only 8 KV heads (8:1 GQA) — 8× smaller KV cache than MHA
🦅Falcon 40B uses MQA with 1 KV head — 64× smaller KV cache than MHA for 64-head models
🤖GPT-3 uses full MHA with 96 heads — each head has its own K,V
📐BERT uses 12 heads with 64-dim each — model_dim = 768
GQA was introduced to bridge the quality gap between MHA and MQA
📊Attention params scale as O(d²) — head count affects KV cache, not param count
🔧MQA trades quality for speed — GQA is the recommended compromise
📈Not All Attention Needed (Li 2024) explores further head reduction

📖 How It Works

1. Multi-Head Attention (MHA)

Each of H_Q query heads has its own K,V. Q=K=V=H_Q. Full expressiveness, highest KV cache.

2. Grouped-Query Attention (GQA)

Q heads are grouped; each group shares one KV head. H_KV < H_Q. Balance of quality and memory.

3. Multi-Query Attention (MQA)

All Q heads share a single KV head. H_KV = 1. Minimal KV cache, may reduce quality.

4. Attention Parameters

Q,K,V,O projections each have d_model × d_model params. Total = 4 × d_model² per layer.

5. KV Cache Savings

Savings = H_Q / H_KV. GQA 8:1 → 8× smaller KV cache than MHA.

🎯 Expert Tips

Prefer GQA for large models

8:1 or 4:1 GQA balances quality and memory. Llama, Mistral, Gemma use GQA.

MQA for inference speed

Falcon uses MQA — minimal KV cache. Use when memory is critical.

head_dim × num_heads = model_dim

Keep model_dim consistent. Common: 128 or 256 head_dim.

Pair with KV Cache Calculator

Use KV Cache Calculator to estimate memory for your attention config.

⚖️ Attention Type Comparison

TypeKV HeadsMemoryModels
MHA= Q headsHighestGPT-3, BERT
GQA&lt; Q headsMediumLlama 3, Mistral, Gemma
MQA1LowestFalcon, PaLM

❓ Frequently Asked Questions

What is MHA vs GQA vs MQA?

MHA: each Q head has its own K,V. GQA: Q heads share KV heads in groups. MQA: all Q heads share 1 KV head. GQA balances quality and memory.

Why does GQA reduce memory?

KV cache stores K,V per KV head. GQA uses fewer KV heads, so less memory. Llama 70B: 64 Q heads, 8 KV heads → 8× smaller KV cache.

Does head count affect parameter count?

No. Attention params = 4 × model_dim² per layer. model_dim = num_Q_heads × head_dim. Head count affects KV cache, not param count.

What is a good GQA ratio?

8:1 (e.g., 64 Q / 8 KV) is common for 70B models. 4:1 for smaller models. Higher ratio = more memory savings, potential quality trade-off.

When to use MQA?

When inference memory/speed is critical and quality loss is acceptable. Falcon uses MQA. GQA is usually preferred for new models.

How is model_dim related to heads?

model_dim = num_Q_heads × head_dim. E.g., 64 heads × 128 dim = 8192 model_dim.

📊 Attention by the Numbers

Q,K,V,O projections
8:1
GQA ratio (Llama 70B)
1
MQA KV heads
Params scale

⚠️ Disclaimer: This calculator provides estimates for educational and architecture design. Actual model behavior depends on training, data, and implementation. Use with KV Cache and GPU VRAM calculators for deployment planning.

👈 START HERE
⬅️Jump in and explore the concept!
AI

Related Calculators