OPTIMIZATIONModel Evaluation & HyperparametersML Calculator
🧮

Neural Network Parameter Counting

Count total parameters for transformer architectures: Embedding, Multi-Head Attention, LayerNorm, and FFN. From Llama 3 70B to BERT — understand model size and plan VRAM, training cost, and scaling.

Concept Fundamentals
(input+1) × output
Dense Layer
Weights + biases
k² × c_in × c_out
Conv Layer
Kernel parameters
vocab × d_model
Embedding
Token embeddings
Sum all layers
Total Params
Model size measure
CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: Parameter count drives VRAM needs, inference cost, and training budget. FFN dominates (~66%); MHA contributes ~33%. Chinchilla scaling uses ~20× params in tokens.

How: Embedding = V×d. MHA = 4d²+4d per layer. LayerNorm = 4d per layer. FFN = 2dm+m+d per layer. Total = Embedding + L×(MHA+LN+FFN).

  • FFN ~66%, MHA ~33%
  • VRAM ≈ 2 bytes/param FP16
  • Chinchilla: 20× params in tokens
  • LayerNorm <1%
🧮
TRANSFORMER PARAMETER COUNTER

Count Parameters for Embedding, MHA, LayerNorm & FFN

From Llama 3 70B to BERT — understand model size and plan VRAM, training cost, and scaling.

📊 Quick Examples — Click to Load

Inputs

transformer blocks
model dimension
attention heads
vocabulary size
FFN hidden (often 4×d)
dim per head
param-count.sh
CALCULATED
Total Params
108.50M
Embedding
23.44M
MHA
28.35M
LayerNorm
36.86K
FFN
56.67M
Share:
Neural Network Parameters
Total Params
108.50M
12L × 768d×vocab 30522|FFN 52%
numbervibe.com/calculators/machine-learning/neural-network-parameter-calculator

Parameters by Layer Type

Parameter Distribution (%)

Cumulative Parameters per Layer

1. Embedding
V × d = 30522 × 768 = 23.44M
2. MHA per layer
4d^2 + 4d = 4 × 768^2 + 4 × 768 = 2.36M
3. LayerNorm per layer
2 × 2d = 4d = 4 × 768 = 3.07K
4. FFN per layer
2d × m + m + d = 2 × 768 × 3072 + 3072 + 768 = 4.72M
5. Total
Total = Emb + L × (MHA + LN + FFN) = 108.50M

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

🧮

Llama 3 70B has ~70B parameters; FFN contributes ~66% of total

— Architecture

MHA has 4d² params (Q,K,V,O); FFN has ~8d² when intermediate=4d

— Vaswani

📐

Chinchilla scaling: train on 20× params in tokens for compute-optimal

— Chinchilla

📉

Quantization (INT8/INT4) reduces memory, not parameter count

— Best practice

📋 Key Takeaways

  • • FFN dominates parameter count (~66%) in standard transformers; MHA ~33%
  • • Embedding scales with vocab × dim — large vocabularies add significant params
  • • LayerNorm is negligible (<1%) but essential for training stability
  • • Chinchilla: compute-optimal tokens ≈ 20× parameters; C = 6PD
  • • VRAM ≈ 2 bytes/param for FP16, 4 for FP32 — use for memory planning

💡 Did You Know

🧮Llama 3 70B has ~70B parameters; FFN contributes ~66% of total
MHA has 4d² params (Q,K,V,O); FFN has ~8d² when intermediate=4d
📐Linear layer: in×out + out (bias). Conv2D: K²×in×out + out
🔧flopth and HuggingFace model cards validate parameter counts
🎯Chinchilla scaling: train on 20× params in tokens for compute-optimal
📉Quantization (INT8/INT4) reduces memory, not parameter count
🔀LoRA adds ~0.1% params for fine-tuning; full fine-tune = 100%
📈Doubling hidden dim quadruples MHA and FFN parameters

📖 How It Works

1. Embedding

Token embedding: vocab × dim. One matrix maps token IDs to hidden vectors.

2. Multi-Head Attention (MHA)

Q,K,V,O projections: 4 × (d×d) + 4×d bias = 4d² + 4d. Each head shares the same projection matrices.

3. LayerNorm

Gamma and beta: 2×d per LayerNorm. Two per block (pre-attn, pre-FFN) = 4d per layer.

4. FFN

Two linear layers: d→intermediate (d×m + m) and intermediate→d (m×d + d). Total: 2dm + m + d.

5. Grand Total

Embedding + numLayers × (MHA + LayerNorm + FFN). Sum all components.

🎯 Expert Tips

Validate with model.summary()

PyTorch/HuggingFace provide exact counts. Use for production validation.

VRAM planning

FP16: 2 bytes/param. 70B model ≈ 140GB. Add optimizer states for training.

Chinchilla scaling

Compute-optimal: tokens ≈ 20× params. C = 6PD for total compute.

Compare architectures

Use presets to compare Llama, BERT, GPT-2, ViT parameter distributions.

⚖️ Parameters by Layer Type

ComponentFormulaScalingTypical Share
EmbeddingV × dO(Vd)1–5%
MHA4d² + 4dO(d²)~33%
LayerNorm4dO(d)&lt;1%
FFN2dm + m + dO(dm)~66%

❓ Frequently Asked Questions

What are trainable parameters?

Weights and biases that are updated during training. Embedding, Linear, LayerNorm, MHA, and FFN layers all have trainable parameters.

Why does FFN dominate parameter count?

FFN has two large linear layers (d→4d and 4d→d). With intermediate=4d, that's 8d² + 5d vs MHA's 4d² + 4d. FFN is typically 2× MHA.

How does this relate to VRAM?

VRAM ≈ params × bytes/param. FP16=2, FP32=4. 70B FP16 ≈ 140GB. Add optimizer states (2× params in FP32) for training.

What about tied embeddings?

Many LLMs tie input and output embeddings. This calculator counts token embedding only. Tied output would not add extra params.

How accurate are these estimates?

Within ~5% for standard transformers. Actual models may have RoPE, RMSNorm, or SwiGLU variants that slightly change counts.

What is Chinchilla scaling?

Compute-optimal training uses ~20× params in tokens. C = 6PD where P=params, D=tokens. Undertraining wastes compute.

How to reduce parameters?

Pruning, distillation, LoRA (low-rank adaptation), quantization (reduces memory, not count), and model compression.

Why count parameters?

Planning VRAM, inference cost, training budget. Parameter count correlates with model capacity and compute requirements.

📊 Parameters by the Numbers

~66%
FFN Share
~33%
MHA Share
2B
FP16 bytes/param
20×
Chinchilla tokens

⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual parameter counts depend on implementation (e.g., fused LayerNorm, SwiGLU vs GELU, RoPE). Use model.summary(), flopth, or HuggingFace model cards for production validation. VRAM and training cost estimates require additional factors (precision, optimizer, activation memory).

🚀 DIVING IN
🏊Let's explore the numbers!
AI

Related Calculators