OPTIMIZATIONLLM Training & ScalingML Calculator
🧮

Neural Network FLOPs Estimation

Calculate floating-point operations for transformer layers: Attention (O(S²d)), FFN (O(Sd²)), Embedding, and LayerNorm. From Llama 3 to BERT — understand compute cost per layer.

Concept Fundamentals
~2× params (MACs)
Forward Pass
Multiply-accumulate ops
~6× params
Full Step
Forward + backward + update
FLOP/s utilization
Throughput
Hardware efficiency
Compute budgeting
Application
Training cost estimation
CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: FLOPs drive training time and cost. Attention scales O(S²) — long contexts are expensive. FFN scales O(Sd²). MFU measures how well you use GPU compute.

How: Attention: 8S²d + 4Sd². FFN: 16Sd². Embedding: 2SVd. LayerNorm: 8Sd per layer. Total = sum over layers.

  • Attention O(S²)
  • FFN O(Sd²)
  • MFU 30–50% typical
  • Backward ~2× forward
🧮
TRANSFORMER FLOPs IN 2026

Calculate FLOPs for Attention, FFN, Embedding & LayerNorm

From Llama 3 to BERT — understand compute cost per layer. Plan MFU, scaling, and optimization.

📊 Quick Examples — Click to Load

Inputs

transformer blocks
model dimension
attention heads
sequence length
vocabulary size
FFN hidden (often 4d)
batch size
flops-calc.sh
CALCULATED
Total FLOPs
115.85 GFLOP
Attention
33.82 GFLOP
FFN
57.98 GFLOP
Embedding
24.00 GFLOP
LayerNorm
37.75 MFLOP
Share:
Neural Network FLOPs (Forward Pass)
Total FLOPs
115.85 GFLOP
12L × 768d×seq 512×batch 1|Attn 29%
numbervibe.com/calculators/machine-learning/neural-network-flops-calculator

FLOPs by Layer Type

FLOPs Share (%)

Cumulative FLOPs per Layer

1. Attention per layer
8 S^2 d + 4 S d^2 = 8 × 512^2 × 768 + 4 × 512 × 768^2
2. FFN per layer
16 S d^2 = 16 × 512 × 768^2
3. Embedding
2 S × V × d = 2 × 512 × 30522 × 768
4. LayerNorm per layer
4 S d = 4 × 512 × 768
5. Total
Total = 12 × (Attn + FFN + 2 × LN) + Emb = 115.85 GFLOP

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

🧮

Llama 3 70B forward pass: ~1.5 PFLOP per token at seq 2048

— Architecture

Attention dominates at long sequences; FFN dominates at short seq with large d

— Vaswani

🎯

MFU = (model FLOPs × batch) / (GPU TFLOP/s × time) — key training metric

— Best practice

📉

FlashAttention-2 reduces attention memory and can improve throughput 2–4×

— Optimization

Reference table — classical transformer FLOPs (Q1 2026 stamp)

Formulas follow standard big-O accounting for multi-head attention (O(S²d)) and FFN (O(S d²)). Hardware achieves lower empirical energy via fusion, FlashAttention, and sparsity — this UI reports analytical FLOPs, not vendor TFLOP/s.

BlockScaling (per layer, order)
Self-attentionDominant S²d terms for QKV + attn + out proj
FFNLinear in S, quadratic in d (width)
TrainingForward + backward ≈ ~3× forward FLOPs (rule of thumb)

📋 Key Takeaways

  • • Attention scales as O(S²d) — quadratic in sequence length; long contexts are expensive
  • • FFN scales as O(Sd²) — linear in seq, quadratic in hidden dim
  • • MFU (Model FLOPs Utilization) = achieved TFLOP/s ÷ theoretical peak; 30–50% is typical
  • • Sparse attention (e.g., Longformer, FlashAttention) reduces the S² term
  • • Activation checkpointing trades compute for memory — recomputes activations in backward

💡 Did You Know

🧮Llama 3 70B forward pass: ~1.5 PFLOP per token at seq 2048
Attention dominates at long sequences; FFN dominates at short seq with large d
📐Vaswani et al. 2017 derived O(S²d) complexity for self-attention
🔧flopth and MMEngine provide FLOPs profiling for PyTorch models
🎯MFU = (model FLOPs × batch) / (GPU TFLOP/s × time) — key training metric
📉FlashAttention-2 reduces attention memory and can improve throughput 2–4×
🔀Korthikanti et al. 2022: activation recomputation trades 20% compute for 3–5× memory savings
📈Doubling sequence length quadruples attention FLOPs

📖 How It Works

1. Attention

Q,K,V projections (3× 2Sd²), attention scores (2S²d), output projection (2Sd²). Total: 8S²d + 4Sd².

2. FFN

Two linear layers: d→4d (8Sd²) and 4d→d (8Sd²). Total: 16Sd² with intermediate=4d.

3. Embedding

Token lookup + projection: 2×S×V×d. Dominant for large vocabularies.

4. LayerNorm

Mean, variance, scale, shift: 4Sd per layer. Two LayerNorms per block (pre-attn, pre-FFN).

5. Batch Scaling

S = batchSize × seqLength. All formulas scale linearly with batch size.

🎯 Expert Tips

Sparse attention for long seq

Local + global patterns reduce S² to O(S log S). Use for 8K+ context.

FlashAttention

Fused kernels reduce memory bandwidth. 2–4× speedup for attention.

Activation checkpointing

Recompute activations in backward. ~20% compute for 3–5× memory savings.

Profile with flopth/MMEngine

Validate estimates against actual model runs. Framework overhead matters.

⚖️ FLOPs by Layer Type

ComponentFormulaScalingTypical Share
Attention8S²d + 4Sd²O(S²d)40–60% at long seq
FFN16Sd²O(Sd²)30–50%
Embedding2SVdO(SVd)<5% (small V)
LayerNorm4Sd × 2LO(Sd)<5%

❓ Frequently Asked Questions

What are FLOPs?

Floating-point operations. One FLOP = one multiply-add. Used to measure compute cost of neural networks.

Why does attention scale as S²?

The attention matrix is S×S (each token attends to every other). QK^T and softmax×V both scale with S².

What is MFU?

Model FLOPs Utilization = achieved throughput / theoretical peak. 30–50% is typical for training; higher for inference.

Do these formulas include backward pass?

No. Training backward pass is ~2× forward. Total training FLOPs ≈ 3× forward (forward + backward).

How accurate are these estimates?

Within ~10–20%. Actual FLOPs depend on implementation (e.g., fused kernels, sparse attention).

What about inference vs training?

Inference = forward only. Training = forward + backward + optimizer. This calculator gives forward FLOPs.

How to reduce FLOPs?

Sparse attention, smaller models, quantization (fewer effective FLOPs), pruning, distillation.

Relation to C=6PD (Chinchilla)?

C=6PD estimates total training FLOPs. This calculator gives per-forward FLOPs. Training FLOPs ≈ 6 × params × tokens.

📊 FLOPs by the Numbers

O(S²)
Attention
O(Sd²)
FFN
30–50%
Typical MFU
Backward vs Fwd

⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual FLOPs depend on implementation (fused kernels, sparse attention, FlashAttention), framework overhead, and hardware. Use flopth, MMEngine, or PyTorch profilers for production validation. MFU and training costs require additional factors (utilization, memory bandwidth).

👈 START HERE
⬅️Jump in and explore the concept!
AI

Related Calculators