5 more

OPTIMIZATIONArchitecture & DesignML Calculator

🧠

Attention Head Configuration

Configure MHA, MQA, and GQA attention. Based on Vaswani 2017, Shazeer 2019 MQA, Ainslie 2023 GQA. Design attention for Llama, GPT, Falcon, Mistral, BERT.

Concept Fundamentals

d_k = d_model / n_heads

Head Dimension

Per-head size

Concat(head_i) × W_O

Multi-Head

Parallel attention

3 projections per head

QKV

Query, Key, Value

Key-value sharing

MQA/GQA

Memory optimization

CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: GQA reduces KV cache 8× vs MHA (Llama 70B). MQA minimizes memory (Falcon). Attention params = 4×model_dim² — head count affects KV cache, not params.

How: model_dim = Q_heads × head_dim. MHA: KV=Q. GQA: KV<Q, groups share. MQA: KV=1. Savings = Q_heads / KV_heads.

●GQA 8:1 Llama 70B
●MQA 1 KV head
●Params = 4d²
●KV cache scales with KV heads

Sources:Vaswani et al. 2017 - Attention Is All You NeedShazeer 2019 - Fast Transformer Decoding: One Write-Head is All You Need

🧠

MHA • GQA • MQA

Configure Attention Architecture

Based on Vaswani 2017, Shazeer 2019 MQA, Ainslie 2023 GQA. Design attention for Llama, GPT, Falcon, Mistral, BERT.

KV Cache →GPU VRAM →

📊 Quick Examples — Click to Load

Inputs

Attention Type

Q Headsquery heads

KV Headskey/value heads (MQA=1)

Head Dimdim per head

Layerstransformer layers

attention-config.sh

CALCULATED

Model Dim

8,192

Attn Params (B)

21.47

KV Savings

8.0×

KV Heads

Q per KV

Type

GQA

Attention Head Configuration

KV Cache Savings

8.0×

GQA×64Q/8KV→Model Dim: 8192

numbervibe.com/calculators/machine-learning/attention-head-calculator

Memory Comparison Across Attention Types

Parameter Comparison (Grouped Bar)

KV Heads: Used vs Saved

1. Model Dimension

d_{model} = H_Q × d_{head} = 64 × 128 = 8192

2. Attention Params per Layer

P_{attn} = 4 × d_{model}^2 = 4 × 8192^2 = 268,435,456

3. KV Cache Savings

Savings = \frac{H_Q}{H_{KV}} = \frac{64}{8} = 8.0 ×

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

🦙

Llama 3 70B uses 64 Q heads but only 8 KV heads (8:1 GQA) — 8× smaller KV cache than MHA

— Llama

🦅

Falcon 40B uses MQA with 1 KV head — 64× smaller KV cache than MHA for 64-head models

— Falcon

📐

Attention params = 4 × model_dim² per layer — independent of head count

— Vaswani

⚡

GQA was introduced to bridge the quality gap between MHA and MQA

— Ainslie 2023

📋 Key Takeaways

• MHA: each Q head has its own K,V — best quality, highest memory
• GQA: Q heads share KV heads in groups — balances quality and memory (Llama, Mistral)
• MQA: single shared KV head — lowest memory, may hurt quality (Falcon)
• KV cache savings = Q heads / KV heads — GQA 8:1 gives 8× smaller KV cache
• Attention params = 4 × model_dim² per layer — independent of head count
• model_dim = num_Q_heads × head_dim — must be consistent

💡 Did You Know

🦙Llama 3 70B uses 64 Q heads but only 8 KV heads (8:1 GQA) — 8× smaller KV cache than MHA

🦅Falcon 40B uses MQA with 1 KV head — 64× smaller KV cache than MHA for 64-head models

🤖GPT-3 uses full MHA with 96 heads — each head has its own K,V

📐BERT uses 12 heads with 64-dim each — model_dim = 768

⚡GQA was introduced to bridge the quality gap between MHA and MQA

📊Attention params scale as O(d²) — head count affects KV cache, not param count

🔧MQA trades quality for speed — GQA is the recommended compromise

📈Not All Attention Needed (Li 2024) explores further head reduction

📖 How It Works

1. Multi-Head Attention (MHA)

Each of H_Q query heads has its own K,V. Q=K=V=H_Q. Full expressiveness, highest KV cache.

2. Grouped-Query Attention (GQA)

Q heads are grouped; each group shares one KV head. H_KV < H_Q. Balance of quality and memory.

3. Multi-Query Attention (MQA)

All Q heads share a single KV head. H_KV = 1. Minimal KV cache, may reduce quality.

4. Attention Parameters

Q,K,V,O projections each have d_model × d_model params. Total = 4 × d_model² per layer.

5. KV Cache Savings

Savings = H_Q / H_KV. GQA 8:1 → 8× smaller KV cache than MHA.

🎯 Expert Tips

Prefer GQA for large models

8:1 or 4:1 GQA balances quality and memory. Llama, Mistral, Gemma use GQA.

MQA for inference speed

Falcon uses MQA — minimal KV cache. Use when memory is critical.

head_dim × num_heads = model_dim

Keep model_dim consistent. Common: 128 or 256 head_dim.

Pair with KV Cache Calculator

Use KV Cache Calculator to estimate memory for your attention config.

⚖️ Attention Type Comparison

Type	KV Heads	Memory	Models
MHA	= Q heads	Highest	GPT-3, BERT
GQA	< Q heads	Medium	Llama 3, Mistral, Gemma
MQA	1	Lowest	Falcon, PaLM

❓ Frequently Asked Questions

What is MHA vs GQA vs MQA?

MHA: each Q head has its own K,V. GQA: Q heads share KV heads in groups. MQA: all Q heads share 1 KV head. GQA balances quality and memory.

Why does GQA reduce memory?

KV cache stores K,V per KV head. GQA uses fewer KV heads, so less memory. Llama 70B: 64 Q heads, 8 KV heads → 8× smaller KV cache.

Does head count affect parameter count?

No. Attention params = 4 × model_dim² per layer. model_dim = num_Q_heads × head_dim. Head count affects KV cache, not param count.

What is a good GQA ratio?

8:1 (e.g., 64 Q / 8 KV) is common for 70B models. 4:1 for smaller models. Higher ratio = more memory savings, potential quality trade-off.

When to use MQA?

When inference memory/speed is critical and quality loss is acceptable. Falcon uses MQA. GQA is usually preferred for new models.

How is model_dim related to heads?

model_dim = num_Q_heads × head_dim. E.g., 64 heads × 128 dim = 8192 model_dim.

📊 Attention by the Numbers

4×

Q,K,V,O projections

8:1

GQA ratio (Llama 70B)

MQA KV heads

d²

Params scale

📚 Official Sources

Vaswani et al. 2017 - Attention Is All You Need ↗

Original Transformer — Multi-Head Attention (MHA)

Updated: 2017

Shazeer 2019 - Fast Transformer Decoding: One Write-Head is All You Need ↗

Multi-Query Attention (MQA) — shared KV heads

Updated: 2019

Ainslie et al. 2023 - GQA: Training Generalized Multi-Query Transformer Models ↗

Grouped-Query Attention (GQA) — balance MHA and MQA

Updated: 2023

Li et al. 2024 - Not All Attention Needed ↗

Efficient attention and head reduction

Updated: 2024

⚠️ Disclaimer: This calculator provides estimates for educational and architecture design. Actual model behavior depends on training, data, and implementation. Use with KV Cache and GPU VRAM calculators for deployment planning.

👈 START HERE

⬅️Jump in and explore the concept!

Attention Head Configuration

Why This ML Metric Matters

Configure Attention Architecture

📊 Quick Examples — Click to Load

Inputs

Memory Comparison Across Attention Types

Parameter Comparison (Grouped Bar)

KV Heads: Used vs Saved

🤖 AI & ML Facts

📋 Key Takeaways

💡 Did You Know

📖 How It Works

1. Multi-Head Attention (MHA)

2. Grouped-Query Attention (GQA)

3. Multi-Query Attention (MQA)

4. Attention Parameters

5. KV Cache Savings

🎯 Expert Tips

Prefer GQA for large models

MQA for inference speed

head_dim × num_heads = model_dim

Pair with KV Cache Calculator

⚖️ Attention Type Comparison

❓ Frequently Asked Questions

What is MHA vs GQA vs MQA?

Why does GQA reduce memory?

Does head count affect parameter count?

What is a good GQA ratio?

When to use MQA?

How is model_dim related to heads?

📊 Attention by the Numbers

📚 Official Sources

Related Calculators

Mixture of Experts (MoE) Efficiency Calculator

KV Cache Size Estimator

Activation Memory Calculator

AI Fairness & Bias Calculator

Batch Size & Learning Rate Calculator

Compute-Optimal Model Size Calculator (Chinchilla)

We Value Your Privacy

Attention Head Configuration

Why This ML Metric Matters

Configure Attention Architecture

📊 Quick Examples — Click to Load

Inputs

Memory Comparison Across Attention Types

Parameter Comparison (Grouped Bar)

KV Heads: Used vs Saved

🤖 AI & ML Facts

📋 Key Takeaways

💡 Did You Know

📖 How It Works

1. Multi-Head Attention (MHA)

2. Grouped-Query Attention (GQA)

3. Multi-Query Attention (MQA)

4. Attention Parameters

5. KV Cache Savings

🎯 Expert Tips

Prefer GQA for large models

MQA for inference speed

head_dim × num_heads = model_dim

Pair with KV Cache Calculator

⚖️ Attention Type Comparison

❓ Frequently Asked Questions

What is MHA vs GQA vs MQA?

Why does GQA reduce memory?

Does head count affect parameter count?

What is a good GQA ratio?

When to use MQA?

How is model_dim related to heads?

📊 Attention by the Numbers

📚 Official Sources

Related ML Calculators

Related Calculators

Mixture of Experts (MoE) Efficiency Calculator

KV Cache Size Estimator

Activation Memory Calculator

AI Fairness & Bias Calculator

Batch Size & Learning Rate Calculator

Compute-Optimal Model Size Calculator (Chinchilla)

We Value Your Privacy