Neural Network FLOPs Estimation
Calculate floating-point operations for transformer layers: Attention (O(S²d)), FFN (O(Sd²)), Embedding, and LayerNorm. From Llama 3 to BERT — understand compute cost per layer.
Why This ML Metric Matters
Why: FLOPs drive training time and cost. Attention scales O(S²) — long contexts are expensive. FFN scales O(Sd²). MFU measures how well you use GPU compute.
How: Attention: 8S²d + 4Sd². FFN: 16Sd². Embedding: 2SVd. LayerNorm: 8Sd per layer. Total = sum over layers.
- ●Attention O(S²)
- ●FFN O(Sd²)
- ●MFU 30–50% typical
- ●Backward ~2× forward
Calculate FLOPs for Attention, FFN, Embedding & LayerNorm
From Llama 3 to BERT — understand compute cost per layer. Plan MFU, scaling, and optimization.
📊 Quick Examples — Click to Load
Inputs
FLOPs by Layer Type
FLOPs Share (%)
Cumulative FLOPs per Layer
For educational and informational purposes only. Verify with a qualified professional.
🤖 AI & ML Facts
Llama 3 70B forward pass: ~1.5 PFLOP per token at seq 2048
— Architecture
Attention dominates at long sequences; FFN dominates at short seq with large d
— Vaswani
MFU = (model FLOPs × batch) / (GPU TFLOP/s × time) — key training metric
— Best practice
FlashAttention-2 reduces attention memory and can improve throughput 2–4×
— Optimization
Reference table — classical transformer FLOPs (Q1 2026 stamp)
Formulas follow standard big-O accounting for multi-head attention (O(S²d)) and FFN (O(S d²)). Hardware achieves lower empirical energy via fusion, FlashAttention, and sparsity — this UI reports analytical FLOPs, not vendor TFLOP/s.
| Block | Scaling (per layer, order) |
|---|---|
| Self-attention | Dominant S²d terms for QKV + attn + out proj |
| FFN | Linear in S, quadratic in d (width) |
| Training | Forward + backward ≈ ~3× forward FLOPs (rule of thumb) |
📋 Key Takeaways
- • Attention scales as O(S²d) — quadratic in sequence length; long contexts are expensive
- • FFN scales as O(Sd²) — linear in seq, quadratic in hidden dim
- • MFU (Model FLOPs Utilization) = achieved TFLOP/s ÷ theoretical peak; 30–50% is typical
- • Sparse attention (e.g., Longformer, FlashAttention) reduces the S² term
- • Activation checkpointing trades compute for memory — recomputes activations in backward
💡 Did You Know
📖 How It Works
1. Attention
Q,K,V projections (3× 2Sd²), attention scores (2S²d), output projection (2Sd²). Total: 8S²d + 4Sd².
2. FFN
Two linear layers: d→4d (8Sd²) and 4d→d (8Sd²). Total: 16Sd² with intermediate=4d.
3. Embedding
Token lookup + projection: 2×S×V×d. Dominant for large vocabularies.
4. LayerNorm
Mean, variance, scale, shift: 4Sd per layer. Two LayerNorms per block (pre-attn, pre-FFN).
5. Batch Scaling
S = batchSize × seqLength. All formulas scale linearly with batch size.
🎯 Expert Tips
Sparse attention for long seq
Local + global patterns reduce S² to O(S log S). Use for 8K+ context.
FlashAttention
Fused kernels reduce memory bandwidth. 2–4× speedup for attention.
Activation checkpointing
Recompute activations in backward. ~20% compute for 3–5× memory savings.
Profile with flopth/MMEngine
Validate estimates against actual model runs. Framework overhead matters.
⚖️ FLOPs by Layer Type
| Component | Formula | Scaling | Typical Share |
|---|---|---|---|
| Attention | 8S²d + 4Sd² | O(S²d) | 40–60% at long seq |
| FFN | 16Sd² | O(Sd²) | 30–50% |
| Embedding | 2SVd | O(SVd) | <5% (small V) |
| LayerNorm | 4Sd × 2L | O(Sd) | <5% |
❓ Frequently Asked Questions
What are FLOPs?
Floating-point operations. One FLOP = one multiply-add. Used to measure compute cost of neural networks.
Why does attention scale as S²?
The attention matrix is S×S (each token attends to every other). QK^T and softmax×V both scale with S².
What is MFU?
Model FLOPs Utilization = achieved throughput / theoretical peak. 30–50% is typical for training; higher for inference.
Do these formulas include backward pass?
No. Training backward pass is ~2× forward. Total training FLOPs ≈ 3× forward (forward + backward).
How accurate are these estimates?
Within ~10–20%. Actual FLOPs depend on implementation (e.g., fused kernels, sparse attention).
What about inference vs training?
Inference = forward only. Training = forward + backward + optimizer. This calculator gives forward FLOPs.
How to reduce FLOPs?
Sparse attention, smaller models, quantization (fewer effective FLOPs), pruning, distillation.
Relation to C=6PD (Chinchilla)?
C=6PD estimates total training FLOPs. This calculator gives per-forward FLOPs. Training FLOPs ≈ 6 × params × tokens.
📊 FLOPs by the Numbers
📚 Official Sources
⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual FLOPs depend on implementation (fused kernels, sparse attention, FlashAttention), framework overhead, and hardware. Use flopth, MMEngine, or PyTorch profilers for production validation. MFU and training costs require additional factors (utilization, memory bandwidth).
Related Calculators
LLM Training Cost Estimator
Estimate LLM training costs using the C=6PD formula. Calculate GPU hours, total FLOPs, and dollar costs based on Chinchilla scaling laws.
Machine LearningCompute-Optimal Model Size Calculator (Chinchilla)
Find the compute-optimal model size and training tokens given a compute budget using Chinchilla scaling laws.
Machine LearningGPU VRAM / Memory Requirements Calculator
Calculate GPU memory requirements for training and inference. Compare FP32, FP16, BF16, INT8, and INT4 precision formats.
Machine LearningToken Count & LLM API Cost Calculator
Compare token costs across OpenAI, Anthropic, Google, and Mistral. Calculate input vs output token pricing for any LLM API.
Machine LearningLoRA / QLoRA Fine-Tuning Parameter Calculator
Calculate trainable parameters, memory savings, and adapter sizes for LoRA and QLoRA fine-tuning of large language models.
Machine LearningActivation Memory Calculator
Estimate activation memory with and without gradient checkpointing. Based on NVIDIA selective recomputation research.
Machine Learning