5 more

OPTIMIZATIONLLM Training & ScalingML Calculator

🧮

Neural Network FLOPs Estimation

Calculate floating-point operations for transformer layers: Attention (O(S²d)), FFN (O(Sd²)), Embedding, and LayerNorm. From Llama 3 to BERT — understand compute cost per layer.

Concept Fundamentals

~2× params (MACs)

Forward Pass

Multiply-accumulate ops

~6× params

Full Step

Forward + backward + update

FLOP/s utilization

Throughput

Hardware efficiency

Compute budgeting

Application

Training cost estimation

CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: FLOPs drive training time and cost. Attention scales O(S²) — long contexts are expensive. FFN scales O(Sd²). MFU measures how well you use GPU compute.

How: Attention: 8S²d + 4Sd². FFN: 16Sd². Embedding: 2SVd. LayerNorm: 8Sd per layer. Total = sum over layers.

●Attention O(S²)
●FFN O(Sd²)
●MFU 30–50% typical
●Backward ~2× forward

Sources:flopth - FLOPs Counter for PyTorchMMEngine - OpenMMLab

🧮

TRANSFORMER FLOPs IN 2026

Calculate FLOPs for Attention, FFN, Embedding & LayerNorm

From Llama 3 to BERT — understand compute cost per layer. Plan MFU, scaling, and optimization.

LLM Training Cost →Chinchilla Scaling →GPU VRAM →Activation Memory →

📊 Quick Examples — Click to Load

Inputs

Num Layerstransformer blocks

Hidden Dim (d)model dimension

Num Headsattention heads

Seq Lengthsequence length

Vocab Sizevocabulary size

Intermediate SizeFFN hidden (often 4d)

Batch Sizebatch size

flops-calc.sh

CALCULATED

Total FLOPs

115.85 GFLOP

Attention

33.82 GFLOP

FFN

57.98 GFLOP

Embedding

24.00 GFLOP

LayerNorm

37.75 MFLOP

Neural Network FLOPs (Forward Pass)

Total FLOPs

115.85 GFLOP

12L × 768d×seq 512×batch 1|Attn 29%

numbervibe.com/calculators/machine-learning/neural-network-flops-calculator

FLOPs by Layer Type

FLOPs Share (%)

Cumulative FLOPs per Layer

1. Attention per layer

8 S^2 d + 4 S d^2 = 8 × 512^2 × 768 + 4 × 512 × 768^2

2. FFN per layer

16 S d^2 = 16 × 512 × 768^2

3. Embedding

2 S × V × d = 2 × 512 × 30522 × 768

4. LayerNorm per layer

4 S d = 4 × 512 × 768

5. Total

Total = 12 × (Attn + FFN + 2 × LN) + Emb = 115.85 GFLOP

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

🧮

Llama 3 70B forward pass: ~1.5 PFLOP per token at seq 2048

— Architecture

⚡

Attention dominates at long sequences; FFN dominates at short seq with large d

— Vaswani

🎯

MFU = (model FLOPs × batch) / (GPU TFLOP/s × time) — key training metric

— Best practice

📉

FlashAttention-2 reduces attention memory and can improve throughput 2–4×

— Optimization

Reference table — classical transformer FLOPs (Q1 2026 stamp)

Formulas follow standard big-O accounting for multi-head attention (O(S²d)) and FFN (O(S d²)). Hardware achieves lower empirical energy via fusion, FlashAttention, and sparsity — this UI reports analytical FLOPs, not vendor TFLOP/s.

Block	Scaling (per layer, order)
Self-attention	Dominant S²d terms for QKV + attn + out proj
FFN	Linear in S, quadratic in d (width)
Training	Forward + backward ≈ ~3× forward FLOPs (rule of thumb)

📋 Key Takeaways

• Attention scales as O(S²d) — quadratic in sequence length; long contexts are expensive
• FFN scales as O(Sd²) — linear in seq, quadratic in hidden dim
• MFU (Model FLOPs Utilization) = achieved TFLOP/s ÷ theoretical peak; 30–50% is typical
• Sparse attention (e.g., Longformer, FlashAttention) reduces the S² term
• Activation checkpointing trades compute for memory — recomputes activations in backward

💡 Did You Know

🧮Llama 3 70B forward pass: ~1.5 PFLOP per token at seq 2048

⚡Attention dominates at long sequences; FFN dominates at short seq with large d

📐Vaswani et al. 2017 derived O(S²d) complexity for self-attention

🔧flopth and MMEngine provide FLOPs profiling for PyTorch models

🎯MFU = (model FLOPs × batch) / (GPU TFLOP/s × time) — key training metric

📉FlashAttention-2 reduces attention memory and can improve throughput 2–4×

🔀Korthikanti et al. 2022: activation recomputation trades 20% compute for 3–5× memory savings

📈Doubling sequence length quadruples attention FLOPs

📖 How It Works

1. Attention

Q,K,V projections (3× 2Sd²), attention scores (2S²d), output projection (2Sd²). Total: 8S²d + 4Sd².

2. FFN

Two linear layers: d→4d (8Sd²) and 4d→d (8Sd²). Total: 16Sd² with intermediate=4d.

3. Embedding

Token lookup + projection: 2×S×V×d. Dominant for large vocabularies.

4. LayerNorm

Mean, variance, scale, shift: 4Sd per layer. Two LayerNorms per block (pre-attn, pre-FFN).

5. Batch Scaling

S = batchSize × seqLength. All formulas scale linearly with batch size.

🎯 Expert Tips

Sparse attention for long seq

Local + global patterns reduce S² to O(S log S). Use for 8K+ context.

FlashAttention

Fused kernels reduce memory bandwidth. 2–4× speedup for attention.

Activation checkpointing

Recompute activations in backward. ~20% compute for 3–5× memory savings.

Profile with flopth/MMEngine

Validate estimates against actual model runs. Framework overhead matters.

⚖️ FLOPs by Layer Type

Component	Formula	Scaling	Typical Share
Attention	8S²d + 4Sd²	O(S²d)	40–60% at long seq
FFN	16Sd²	O(Sd²)	30–50%
Embedding	2SVd	O(SVd)	<5% (small V)
LayerNorm	4Sd × 2L	O(Sd)	<5%

❓ Frequently Asked Questions

What are FLOPs?

Floating-point operations. One FLOP = one multiply-add. Used to measure compute cost of neural networks.

Why does attention scale as S²?

The attention matrix is S×S (each token attends to every other). QK^T and softmax×V both scale with S².

What is MFU?

Model FLOPs Utilization = achieved throughput / theoretical peak. 30–50% is typical for training; higher for inference.

Do these formulas include backward pass?

No. Training backward pass is ~2× forward. Total training FLOPs ≈ 3× forward (forward + backward).

How accurate are these estimates?

Within ~10–20%. Actual FLOPs depend on implementation (e.g., fused kernels, sparse attention).

What about inference vs training?

Inference = forward only. Training = forward + backward + optimizer. This calculator gives forward FLOPs.

How to reduce FLOPs?

Sparse attention, smaller models, quantization (fewer effective FLOPs), pruning, distillation.

Relation to C=6PD (Chinchilla)?

C=6PD estimates total training FLOPs. This calculator gives per-forward FLOPs. Training FLOPs ≈ 6 × params × tokens.

📊 FLOPs by the Numbers

O(S²)

Attention

O(Sd²)

FFN

30–50%

Typical MFU

2×

Backward vs Fwd

📚 Official Sources

flopth - FLOPs Counter for PyTorch ↗

FLOPs counting library for neural networks

Updated: 2024

MMEngine - OpenMMLab ↗

MMEngine model analysis and FLOPs profiling

Updated: 2024

Korthikanti et al. 2022 - NVIDIA ↗

Reducing Activation Recomputation in Large Transformer Models

Updated: 2022

Vaswani et al. 2017 - Attention Is All You Need ↗

Original transformer architecture and complexity analysis

Updated: 2017

⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual FLOPs depend on implementation (fused kernels, sparse attention, FlashAttention), framework overhead, and hardware. Use flopth, MMEngine, or PyTorch profilers for production validation. MFU and training costs require additional factors (utilization, memory bandwidth).

👈 START HERE

⬅️Jump in and explore the concept!

Neural Network FLOPs Estimation

Why This ML Metric Matters

Calculate FLOPs for Attention, FFN, Embedding & LayerNorm

📊 Quick Examples — Click to Load

Inputs

FLOPs by Layer Type

FLOPs Share (%)

Cumulative FLOPs per Layer

🤖 AI & ML Facts

Reference table — classical transformer FLOPs (Q1 2026 stamp)

📋 Key Takeaways

💡 Did You Know

📖 How It Works

1. Attention

2. FFN

3. Embedding

4. LayerNorm

5. Batch Scaling

🎯 Expert Tips

Sparse attention for long seq

FlashAttention

Activation checkpointing

Profile with flopth/MMEngine

⚖️ FLOPs by Layer Type

❓ Frequently Asked Questions

What are FLOPs?

Why does attention scale as S²?

What is MFU?

Do these formulas include backward pass?

How accurate are these estimates?

What about inference vs training?

How to reduce FLOPs?

Relation to C=6PD (Chinchilla)?

📊 FLOPs by the Numbers

📚 Official Sources

Related ML Calculators

Related Calculators

LLM Training Cost Estimator

Compute-Optimal Model Size Calculator (Chinchilla)

GPU VRAM / Memory Requirements Calculator

Token Count & LLM API Cost Calculator

LoRA / QLoRA Fine-Tuning Parameter Calculator

Activation Memory Calculator

We Value Your Privacy