Context Window Scaling
Standard attention has O(n²) memory — doubling context quadruples memory. FlashAttention reduces to O(n) via tiling and recomputation. Cost scaling ratio = (target/base)².
Why This ML Metric Matters
Why: Long context (128K, 200K, 1M) is critical for RAG and document understanding. Understanding scaling costs helps plan deployment and choose attention type.
How: Enter base and target context lengths, model dim, heads, layers. Select attention type (Standard/Flash/Linear). Calculator computes memory and throughput scaling.
- ●O(n²) standard vs O(n) Flash
- ●4× context = 16× cost
- ●Flash trades compute for memory
- ●Gemini 1M needs Flash
Standard vs Flash Attention Memory & Throughput
Based on Dao 2022/2023 FlashAttention. Compare O(n²) vs O(n) memory. Plan for GPT-4 128K, Claude 200K, Gemini 1M.
📊 Quick Examples — Click to Load
Inputs
Memory vs Context Length (Quadratic vs Linear)
Throughput vs Context Length
For educational and informational purposes only. Verify with a qualified professional.
🤖 AI & ML Facts
Standard attention has O(n²) memory — doubling context quadruples memory
— FlashAttention
FlashAttention reduces memory to O(n) via tiling and recomputation (Dao 2022/2023)
— Dao et al.
Cost scaling ratio = (target/base)² — 4× context = 16× cost
— Attention theory
Gemini 1M, Claude 200K require FlashAttention for feasibility
— Anthropic, Google
📋 Key Takeaways
- • Standard attention has O(n²) memory — doubling context quadruples memory
- • FlashAttention reduces memory to O(n) via tiling and recomputation (Dao 2022/2023)
- • Compute remains O(n²·d) for both — FlashAttention trades compute for memory
- • Cost scaling ratio = (target/base)² — 4× context = 16× cost
- • Throughput degrades with longer context — FlashAttention degrades ~1/n vs 1/n² for standard
- • Gemini 1M, Claude 200K require FlashAttention or linear approximations for feasibility
💡 Did You Know
📖 How It Works
1. Standard Attention
Computes full QK^T matrix of size n×n. Memory O(n²), compute O(n²·d). Storing the matrix dominates at long context.
2. FlashAttention
Tiles Q,K,V into blocks. Computes attention in chunks, recomputes when needed. Memory O(n), compute O(n²·d).
3. Cost Scaling
Doubling context → 4× memory (standard) or 2× (Flash). Quadratic ratio = (target/base)².
4. Throughput
More tokens per forward pass → more compute per token. Throughput drops roughly 1/n for Flash, 1/n² for standard.
5. Linear Attention
Kernel tricks (e.g., Performers) achieve O(n) compute but approximate softmax — quality trade-off.
6. Long Context Models
Gemini 1M, Claude 200K use FlashAttention + architectural tweaks (e.g., MoE, sparse patterns) for efficiency.
🎯 Expert Tips
Use FlashAttention
Always prefer FlashAttention for context > 4K — 10–100× memory savings.
Plan for throughput drop
128K context ≈ 16× slower than 8K for Flash. Batch smaller or use chunked processing.
KV cache + attention
KV cache is O(n) — use KV Cache calculator for inference memory. Attention cost is separate.
H100 / A100
Use FP16/BF16 FlashAttention on Tensor Cores — 2–4× faster than software implementations.
⚖️ Attention Type Comparison
| Type | Memory | Compute | Use Case |
|---|---|---|---|
| Standard | O(n²) | O(n²·d) | Short context, debugging |
| FlashAttention | O(n) | O(n²·d) | Long context (4K+) |
| Linear (e.g. Performer) | O(n) | O(n·d) | Very long, quality trade-off |
❓ Frequently Asked Questions
Why is attention O(n²)?
Each of n query positions attends to all n key positions, producing an n×n attention matrix. Storing and computing this matrix scales quadratically.
How does FlashAttention reduce memory?
FlashAttention tiles Q,K,V into blocks and computes attention in chunks. It recomputes blocks on-the-fly instead of storing the full n×n matrix, trading compute for memory.
What is the cost scaling ratio?
Cost scaling ratio = (target_context / base_context)². Doubling context = 4× cost. 8K→128K = 16× longer = 256× cost.
When does throughput degrade?
Longer context means more tokens per forward pass. Throughput (tokens/sec) drops roughly 1/n for FlashAttention, 1/n² for standard attention.
Standard vs Flash vs linear?
Standard: exact, O(n²) memory. Flash: exact, O(n) memory. Linear: approximate, O(n) compute — may hurt quality.
How accurate is this calculator?
Formulas are standard. Actual memory depends on implementation, precision, and framework. Use for planning and capacity estimation.
What about KV cache?
KV cache is O(n) and separate from attention matrix. Use the KV Cache calculator for inference memory. This calculator focuses on attention scaling.
Can I run 1M context on one GPU?
With FlashAttention and optimizations, Gemini 1.5 Pro 1M runs on high-end GPUs. Standard attention would require 1000+ GB.
Why does FlashAttention have same compute?
FlashAttention still computes all n² attention scores but in blocks. It saves memory by not materializing the full matrix, not by reducing FLOPs.
What is linear attention?
Linear attention (e.g., Performers, Linear Transformers) uses kernel tricks to achieve O(n) compute. Quality can degrade vs softmax attention.
📊 Context Scaling by the Numbers
📚 Official Sources
⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Formulas follow FlashAttention (Dao 2022/2023) and standard attention theory. Actual memory and throughput depend on implementation, hardware, precision, and framework. For production, validate with profiling and benchmarks on target deployment.
Related Calculators
Activation Memory Calculator
Estimate activation memory with and without gradient checkpointing. Based on NVIDIA selective recomputation research.
Machine LearningGradient Accumulation Steps Calculator
Calculate accumulation steps to achieve target effective batch size on limited GPU memory. Based on DeepSpeed ZeRO research.
Machine LearningInference Throughput & Latency Calculator
Estimate tokens/sec, time-to-first-token, and inter-token latency for LLM serving on various GPU configurations.
Machine LearningKV Cache Size Estimator
Calculate KV cache memory for LLM inference with MHA, MQA, and GQA attention types. Based on PagedAttention research.
Machine LearningModel Distillation Size Calculator
Plan teacher-to-student model compression. Calculate size ratios, expected accuracy retention, and training tokens needed.
Machine LearningModel Quantization Tradeoff Calculator
Compare GPTQ, AWQ, and GGUF quantization methods. Calculate memory savings, speed gains, and accuracy tradeoffs.
Machine Learning