What is quadratic attention scaling?

Standard attention has O(n²) memory and compute in sequence length n. Doubling context quadruples memory. FlashAttention reduces memory to O(n) via tiling and recomputation.

How does FlashAttention work?

FlashAttention (Dao 2022/2023) uses tiling and recomputation to trade compute for memory. Memory becomes O(n) instead of O(n²), enabling much longer contexts.

Why does throughput degrade with longer context?

Longer sequences require more attention computation. FlashAttention degrades ~1/n; standard degrades ~1/n². Both reduce tokens/second.

Which models use long context?

GPT-4 Turbo 128K, Claude 3 200K, Gemini 1.5 Pro 1M. All require FlashAttention or linear approximations for feasibility.

Standard vs Flash vs Linear attention?

Standard: O(n²) memory. Flash: O(n) memory, O(n²) compute. Linear: O(n) both but approximate. Flash is the practical choice for long context.

5 more

INFERENCEModel Efficiency & DeploymentML Calculator

📐

Context Window Scaling

Standard attention has O(n²) memory — doubling context quadruples memory. FlashAttention reduces to O(n) via tiling and recomputation. Cost scaling ratio = (target/base)².

Concept Fundamentals

O(n²) memory

Attention

Quadratic scaling

O(n) per layer

KV Cache

Linear with sequence

∝ sequence length

Cost Scaling

Longer = more expensive

Flash Attention

Optimization

IO-aware algorithm

CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: Long context (128K, 200K, 1M) is critical for RAG and document understanding. Understanding scaling costs helps plan deployment and choose attention type.

How: Enter base and target context lengths, model dim, heads, layers. Select attention type (Standard/Flash/Linear). Calculator computes memory and throughput scaling.

●O(n²) standard vs O(n) Flash
●4× context = 16× cost
●Flash trades compute for memory
●Gemini 1M needs Flash

Sources:Dao et al. 2022 - FlashAttention: Fast and Memory-Efficient Exact AttentionDao 2023 - FlashAttention-2: Faster Attention with Better Parallelism

📐

QUADRATIC ATTENTION SCALING

Standard vs Flash Attention Memory & Throughput

Based on Dao 2022/2023 FlashAttention. Compare O(n²) vs O(n) memory. Plan for GPT-4 128K, Claude 200K, Gemini 1M.

KV Cache →GPU VRAM →

📊 Quick Examples — Click to Load

Inputs

Base Context Lengthreference tokens

Target Context Lengthtarget tokens

Model Dimhidden size

Num Headsattention heads

Num Layerstransformer layers

Attention Type

GPU Memory (GB)available VRAM

context-window-cost.sh

CALCULATED

Cost Ratio

256.0×

Standard Mem

25165824.0 GB

Flash Mem

384.0 GB

Throughput

6.3%

Fits GPU

Base

8192 → 131072

Context Window Scaling Cost

Cost Ratio

256×

8192 → 131072×FlashAttention→Flash: 384.0 GB

numbervibe.com/calculators/machine-learning/context-window-cost-calculator

Memory vs Context Length (Quadratic vs Linear)

Throughput vs Context Length

1. Cost Scaling Ratio

ratio = (n_target/n_base)² = (131072/8192)² = 256.0

2. Standard Attention Memory

M_std = O(n²·d) ⇒ M_target ≈ 25165824.0 GB

3. Flash Attention Memory

M_flash = O(n·d) ⇒ M_target ≈ 384.0 GB

4. Throughput Degradation

Throughput ∝ 1/n (Flash) ⇒ 6.3% of base

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

📐

Standard attention has O(n²) memory — doubling context quadruples memory

— FlashAttention

⚡

FlashAttention reduces memory to O(n) via tiling and recomputation (Dao 2022/2023)

— Dao et al.

📊

Cost scaling ratio = (target/base)² — 4× context = 16× cost

— Attention theory

📱

Gemini 1M, Claude 200K require FlashAttention for feasibility

— Anthropic, Google

📋 Key Takeaways

• Standard attention has O(n²) memory — doubling context quadruples memory
• FlashAttention reduces memory to O(n) via tiling and recomputation (Dao 2022/2023)
• Compute remains O(n²·d) for both — FlashAttention trades compute for memory
• Cost scaling ratio = (target/base)² — 4× context = 16× cost
• Throughput degrades with longer context — FlashAttention degrades ~1/n vs 1/n² for standard
• Gemini 1M, Claude 200K require FlashAttention or linear approximations for feasibility

💡 Did You Know

📊Standard attention at 128K context can need 100+ GB memory — FlashAttention cuts this to ~10–20 GB

⚡FlashAttention-2 (Dao 2023) is 2× faster than FlashAttention-1 with better parallelism

🔧FlashAttention uses tiling: process attention in blocks, recompute on-the-fly to avoid storing full O(n²) matrix

🤗Claude 3 supports 200K context; Gemini 1.5 Pro supports 1M tokens — both rely on memory-efficient attention

🎯Quadratic scaling: 8K→128K = 16× longer context but 256× higher attention cost

📐Linear attention (e.g., Performers) achieves O(n) compute but may sacrifice quality

🔀Sparse attention (Longformer, BigBird) reduces cost by attending to a subset of positions

📈KV cache scales O(n) — separate from attention matrix O(n²) — both matter for long context

🧠Anthropic and Google use custom optimizations for million-token context windows

🌐H100 and A100 have hardware support for FlashAttention via Tensor Cores

📖 How It Works

1. Standard Attention

Computes full QK^T matrix of size n×n. Memory O(n²), compute O(n²·d). Storing the matrix dominates at long context.

2. FlashAttention

Tiles Q,K,V into blocks. Computes attention in chunks, recomputes when needed. Memory O(n), compute O(n²·d).

3. Cost Scaling

Doubling context → 4× memory (standard) or 2× (Flash). Quadratic ratio = (target/base)².

4. Throughput

More tokens per forward pass → more compute per token. Throughput drops roughly 1/n for Flash, 1/n² for standard.

5. Linear Attention

Kernel tricks (e.g., Performers) achieve O(n) compute but approximate softmax — quality trade-off.

6. Long Context Models

Gemini 1M, Claude 200K use FlashAttention + architectural tweaks (e.g., MoE, sparse patterns) for efficiency.

🎯 Expert Tips

Use FlashAttention

Always prefer FlashAttention for context > 4K — 10–100× memory savings.

Plan for throughput drop

128K context ≈ 16× slower than 8K for Flash. Batch smaller or use chunked processing.

KV cache + attention

KV cache is O(n) — use KV Cache calculator for inference memory. Attention cost is separate.

H100 / A100

Use FP16/BF16 FlashAttention on Tensor Cores — 2–4× faster than software implementations.

⚖️ Attention Type Comparison

Type	Memory	Compute	Use Case
Standard	O(n²)	O(n²·d)	Short context, debugging
FlashAttention	O(n)	O(n²·d)	Long context (4K+)
Linear (e.g. Performer)	O(n)	O(n·d)	Very long, quality trade-off

❓ Frequently Asked Questions

Why is attention O(n²)?

Each of n query positions attends to all n key positions, producing an n×n attention matrix. Storing and computing this matrix scales quadratically.

How does FlashAttention reduce memory?

FlashAttention tiles Q,K,V into blocks and computes attention in chunks. It recomputes blocks on-the-fly instead of storing the full n×n matrix, trading compute for memory.

What is the cost scaling ratio?

Cost scaling ratio = (target_context / base_context)². Doubling context = 4× cost. 8K→128K = 16× longer = 256× cost.

When does throughput degrade?

Longer context means more tokens per forward pass. Throughput (tokens/sec) drops roughly 1/n for FlashAttention, 1/n² for standard attention.

Standard vs Flash vs linear?

Standard: exact, O(n²) memory. Flash: exact, O(n) memory. Linear: approximate, O(n) compute — may hurt quality.

How accurate is this calculator?

Formulas are standard. Actual memory depends on implementation, precision, and framework. Use for planning and capacity estimation.

What about KV cache?

KV cache is O(n) and separate from attention matrix. Use the KV Cache calculator for inference memory. This calculator focuses on attention scaling.

Can I run 1M context on one GPU?

With FlashAttention and optimizations, Gemini 1.5 Pro 1M runs on high-end GPUs. Standard attention would require 1000+ GB.

Why does FlashAttention have same compute?

FlashAttention still computes all n² attention scores but in blocks. It saves memory by not materializing the full matrix, not by reducing FLOPs.

What is linear attention?

Linear attention (e.g., Performers, Linear Transformers) uses kernel tricks to achieve O(n) compute. Quality can degrade vs softmax attention.

📊 Context Scaling by the Numbers

O(n²)

Standard memory

O(n)

Flash memory

256×

8K→128K cost

Gemini context

📚 Official Sources

Dao et al. 2022 - FlashAttention: Fast and Memory-Efficient Exact Attention ↗

FlashAttention paper — O(n) memory vs O(n²) standard attention

Updated: 2022

Dao 2023 - FlashAttention-2: Faster Attention with Better Parallelism ↗

FlashAttention-2 — 2× faster, better parallelism

Updated: 2023

Anthropic - Claude Model Documentation ↗

Claude 200K context window documentation

Updated: 2024

Google - Gemini 1M Context Research ↗

Gemini 1.5 Pro 1M token context research