5 more

OPTIMIZATIONModel Evaluation & HyperparametersML Calculator

🧮

Neural Network Parameter Counting

Count total parameters for transformer architectures: Embedding, Multi-Head Attention, LayerNorm, and FFN. From Llama 3 70B to BERT — understand model size and plan VRAM, training cost, and scaling.

Concept Fundamentals

(input+1) × output

Dense Layer

Weights + biases

k² × c_in × c_out

Conv Layer

Kernel parameters

vocab × d_model

Embedding

Token embeddings

Sum all layers

Total Params

Model size measure

CalculateUse the calculator below to run neural computations

Why This ML Metric Matters

Why: Parameter count drives VRAM needs, inference cost, and training budget. FFN dominates (~66%); MHA contributes ~33%. Chinchilla scaling uses ~20× params in tokens.

How: Embedding = V×d. MHA = 4d²+4d per layer. LayerNorm = 4d per layer. FFN = 2dm+m+d per layer. Total = Embedding + L×(MHA+LN+FFN).

●FFN ~66%, MHA ~33%
●VRAM ≈ 2 bytes/param FP16
●Chinchilla: 20× params in tokens
●LayerNorm <1%

Sources:LLM Parameter Calculatorflopth - FLOPs & Params Counter

🧮

TRANSFORMER PARAMETER COUNTER

Count Parameters for Embedding, MHA, LayerNorm & FFN

From Llama 3 70B to BERT — understand model size and plan VRAM, training cost, and scaling.

Neural Network FLOPs →GPU VRAM →LLM Training Cost →Chinchilla Scaling →

📊 Quick Examples — Click to Load

Inputs

Num Layerstransformer blocks

Hidden Dim (d)model dimension

Num Headsattention heads

Vocab Sizevocabulary size

Intermediate SizeFFN hidden (often 4×d)

Head Dimdim per head

param-count.sh

CALCULATED

Total Params

108.50M

Embedding

23.44M

MHA

28.35M

LayerNorm

36.86K

FFN

56.67M

Neural Network Parameters

Total Params

108.50M

12L × 768d×vocab 30522|FFN 52%

numbervibe.com/calculators/machine-learning/neural-network-parameter-calculator

Parameters by Layer Type

Parameter Distribution (%)

Cumulative Parameters per Layer

1. Embedding

V × d = 30522 × 768 = 23.44M

2. MHA per layer

4d^2 + 4d = 4 × 768^2 + 4 × 768 = 2.36M

3. LayerNorm per layer

2 × 2d = 4d = 4 × 768 = 3.07K

4. FFN per layer

2d × m + m + d = 2 × 768 × 3072 + 3072 + 768 = 4.72M

5. Total

Total = Emb + L × (MHA + LN + FFN) = 108.50M

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

🧮

Llama 3 70B has ~70B parameters; FFN contributes ~66% of total

— Architecture

⚡

MHA has 4d² params (Q,K,V,O); FFN has ~8d² when intermediate=4d

— Vaswani

📐

Chinchilla scaling: train on 20× params in tokens for compute-optimal

— Chinchilla

📉

Quantization (INT8/INT4) reduces memory, not parameter count

— Best practice

📋 Key Takeaways

• FFN dominates parameter count (~66%) in standard transformers; MHA ~33%
• Embedding scales with vocab × dim — large vocabularies add significant params
• LayerNorm is negligible (<1%) but essential for training stability
• Chinchilla: compute-optimal tokens ≈ 20× parameters; C = 6PD
• VRAM ≈ 2 bytes/param for FP16, 4 for FP32 — use for memory planning

💡 Did You Know

🧮Llama 3 70B has ~70B parameters; FFN contributes ~66% of total

⚡MHA has 4d² params (Q,K,V,O); FFN has ~8d² when intermediate=4d

📐Linear layer: in×out + out (bias). Conv2D: K²×in×out + out

🔧flopth and HuggingFace model cards validate parameter counts

🎯Chinchilla scaling: train on 20× params in tokens for compute-optimal

📉Quantization (INT8/INT4) reduces memory, not parameter count

🔀LoRA adds ~0.1% params for fine-tuning; full fine-tune = 100%

📈Doubling hidden dim quadruples MHA and FFN parameters

📖 How It Works

1. Embedding

Token embedding: vocab × dim. One matrix maps token IDs to hidden vectors.

2. Multi-Head Attention (MHA)

Q,K,V,O projections: 4 × (d×d) + 4×d bias = 4d² + 4d. Each head shares the same projection matrices.

3. LayerNorm

Gamma and beta: 2×d per LayerNorm. Two per block (pre-attn, pre-FFN) = 4d per layer.

4. FFN

Two linear layers: d→intermediate (d×m + m) and intermediate→d (m×d + d). Total: 2dm + m + d.

5. Grand Total

Embedding + numLayers × (MHA + LayerNorm + FFN). Sum all components.

🎯 Expert Tips

Validate with model.summary()

PyTorch/HuggingFace provide exact counts. Use for production validation.

VRAM planning

FP16: 2 bytes/param. 70B model ≈ 140GB. Add optimizer states for training.

Chinchilla scaling

Compute-optimal: tokens ≈ 20× params. C = 6PD for total compute.

Compare architectures

Use presets to compare Llama, BERT, GPT-2, ViT parameter distributions.

⚖️ Parameters by Layer Type

Component	Formula	Scaling	Typical Share
Embedding	V × d	O(Vd)	1–5%
MHA	4d² + 4d	O(d²)	~33%
LayerNorm	4d	O(d)	<1%
FFN	2dm + m + d	O(dm)	~66%

❓ Frequently Asked Questions

What are trainable parameters?

Weights and biases that are updated during training. Embedding, Linear, LayerNorm, MHA, and FFN layers all have trainable parameters.

Why does FFN dominate parameter count?

FFN has two large linear layers (d→4d and 4d→d). With intermediate=4d, that's 8d² + 5d vs MHA's 4d² + 4d. FFN is typically 2× MHA.

How does this relate to VRAM?

VRAM ≈ params × bytes/param. FP16=2, FP32=4. 70B FP16 ≈ 140GB. Add optimizer states (2× params in FP32) for training.

What about tied embeddings?

Many LLMs tie input and output embeddings. This calculator counts token embedding only. Tied output would not add extra params.

How accurate are these estimates?

Within ~5% for standard transformers. Actual models may have RoPE, RMSNorm, or SwiGLU variants that slightly change counts.

What is Chinchilla scaling?

Compute-optimal training uses ~20× params in tokens. C = 6PD where P=params, D=tokens. Undertraining wastes compute.

How to reduce parameters?

Pruning, distillation, LoRA (low-rank adaptation), quantization (reduces memory, not count), and model compression.

Why count parameters?

Planning VRAM, inference cost, training budget. Parameter count correlates with model capacity and compute requirements.

📊 Parameters by the Numbers

~66%

FFN Share

~33%

MHA Share

FP16 bytes/param

20×

Chinchilla tokens

📚 Official Sources

LLM Parameter Calculator ↗

Interactive LLM parameter counting tool

Updated: 2024

flopth - FLOPs & Params Counter ↗

FLOPs and parameter counting for PyTorch models

Updated: 2024

HuggingFace Model Cards ↗

Official model architecture specs and parameter counts

Updated: 2024

⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual parameter counts depend on implementation (e.g., fused LayerNorm, SwiGLU vs GELU, RoPE). Use model.summary(), flopth, or HuggingFace model cards for production validation. VRAM and training cost estimates require additional factors (precision, optimizer, activation memory).

🚀 DIVING IN

🏊Let's explore the numbers!

Neural Network Parameter Counting

Why This ML Metric Matters

Count Parameters for Embedding, MHA, LayerNorm & FFN

📊 Quick Examples — Click to Load

Inputs

Parameters by Layer Type

Parameter Distribution (%)

Cumulative Parameters per Layer

🤖 AI & ML Facts

📋 Key Takeaways

💡 Did You Know

📖 How It Works

1. Embedding

2. Multi-Head Attention (MHA)

3. LayerNorm

4. FFN

5. Grand Total

🎯 Expert Tips

Validate with model.summary()

VRAM planning

Chinchilla scaling

Compare architectures

⚖️ Parameters by Layer Type

❓ Frequently Asked Questions

What are trainable parameters?

Why does FFN dominate parameter count?

How does this relate to VRAM?

What about tied embeddings?

How accurate are these estimates?

What is Chinchilla scaling?

How to reduce parameters?

Why count parameters?

📊 Parameters by the Numbers

📚 Official Sources

Related ML Calculators

Related Calculators

Batch Size & Learning Rate Calculator

Confusion Matrix & Classification Metrics Calculator

Compute-Optimal Model Size Calculator (Chinchilla)

Activation Memory Calculator

AI Fairness & Bias Calculator

Attention Head Configuration Calculator

We Value Your Privacy