Neural Network Parameter Counting
Count total parameters for transformer architectures: Embedding, Multi-Head Attention, LayerNorm, and FFN. From Llama 3 70B to BERT — understand model size and plan VRAM, training cost, and scaling.
Why This ML Metric Matters
Why: Parameter count drives VRAM needs, inference cost, and training budget. FFN dominates (~66%); MHA contributes ~33%. Chinchilla scaling uses ~20× params in tokens.
How: Embedding = V×d. MHA = 4d²+4d per layer. LayerNorm = 4d per layer. FFN = 2dm+m+d per layer. Total = Embedding + L×(MHA+LN+FFN).
- ●FFN ~66%, MHA ~33%
- ●VRAM ≈ 2 bytes/param FP16
- ●Chinchilla: 20× params in tokens
- ●LayerNorm <1%
Count Parameters for Embedding, MHA, LayerNorm & FFN
From Llama 3 70B to BERT — understand model size and plan VRAM, training cost, and scaling.
📊 Quick Examples — Click to Load
Inputs
Parameters by Layer Type
Parameter Distribution (%)
Cumulative Parameters per Layer
For educational and informational purposes only. Verify with a qualified professional.
🤖 AI & ML Facts
Llama 3 70B has ~70B parameters; FFN contributes ~66% of total
— Architecture
MHA has 4d² params (Q,K,V,O); FFN has ~8d² when intermediate=4d
— Vaswani
Chinchilla scaling: train on 20× params in tokens for compute-optimal
— Chinchilla
Quantization (INT8/INT4) reduces memory, not parameter count
— Best practice
📋 Key Takeaways
- • FFN dominates parameter count (~66%) in standard transformers; MHA ~33%
- • Embedding scales with vocab × dim — large vocabularies add significant params
- • LayerNorm is negligible (<1%) but essential for training stability
- • Chinchilla: compute-optimal tokens ≈ 20× parameters; C = 6PD
- • VRAM ≈ 2 bytes/param for FP16, 4 for FP32 — use for memory planning
💡 Did You Know
📖 How It Works
1. Embedding
Token embedding: vocab × dim. One matrix maps token IDs to hidden vectors.
2. Multi-Head Attention (MHA)
Q,K,V,O projections: 4 × (d×d) + 4×d bias = 4d² + 4d. Each head shares the same projection matrices.
3. LayerNorm
Gamma and beta: 2×d per LayerNorm. Two per block (pre-attn, pre-FFN) = 4d per layer.
4. FFN
Two linear layers: d→intermediate (d×m + m) and intermediate→d (m×d + d). Total: 2dm + m + d.
5. Grand Total
Embedding + numLayers × (MHA + LayerNorm + FFN). Sum all components.
🎯 Expert Tips
Validate with model.summary()
PyTorch/HuggingFace provide exact counts. Use for production validation.
VRAM planning
FP16: 2 bytes/param. 70B model ≈ 140GB. Add optimizer states for training.
Chinchilla scaling
Compute-optimal: tokens ≈ 20× params. C = 6PD for total compute.
Compare architectures
Use presets to compare Llama, BERT, GPT-2, ViT parameter distributions.
⚖️ Parameters by Layer Type
| Component | Formula | Scaling | Typical Share |
|---|---|---|---|
| Embedding | V × d | O(Vd) | 1–5% |
| MHA | 4d² + 4d | O(d²) | ~33% |
| LayerNorm | 4d | O(d) | <1% |
| FFN | 2dm + m + d | O(dm) | ~66% |
❓ Frequently Asked Questions
What are trainable parameters?
Weights and biases that are updated during training. Embedding, Linear, LayerNorm, MHA, and FFN layers all have trainable parameters.
Why does FFN dominate parameter count?
FFN has two large linear layers (d→4d and 4d→d). With intermediate=4d, that's 8d² + 5d vs MHA's 4d² + 4d. FFN is typically 2× MHA.
How does this relate to VRAM?
VRAM ≈ params × bytes/param. FP16=2, FP32=4. 70B FP16 ≈ 140GB. Add optimizer states (2× params in FP32) for training.
What about tied embeddings?
Many LLMs tie input and output embeddings. This calculator counts token embedding only. Tied output would not add extra params.
How accurate are these estimates?
Within ~5% for standard transformers. Actual models may have RoPE, RMSNorm, or SwiGLU variants that slightly change counts.
What is Chinchilla scaling?
Compute-optimal training uses ~20× params in tokens. C = 6PD where P=params, D=tokens. Undertraining wastes compute.
How to reduce parameters?
Pruning, distillation, LoRA (low-rank adaptation), quantization (reduces memory, not count), and model compression.
Why count parameters?
Planning VRAM, inference cost, training budget. Parameter count correlates with model capacity and compute requirements.
📊 Parameters by the Numbers
📚 Official Sources
⚠️ Disclaimer: This calculator provides estimates for educational and planning purposes. Actual parameter counts depend on implementation (e.g., fused LayerNorm, SwiGLU vs GELU, RoPE). Use model.summary(), flopth, or HuggingFace model cards for production validation. VRAM and training cost estimates require additional factors (precision, optimizer, activation memory).
Related Calculators
Batch Size & Learning Rate Calculator
Calculate optimal learning rates using linear and square root scaling rules. Visualize warmup and cosine/linear schedules.
Machine LearningConfusion Matrix & Classification Metrics Calculator
Compute Accuracy, Precision, Recall, F1, MCC, Specificity, and ROC-AUC from confusion matrix values.
Machine LearningCompute-Optimal Model Size Calculator (Chinchilla)
Find the compute-optimal model size and training tokens given a compute budget using Chinchilla scaling laws.
Machine LearningActivation Memory Calculator
Estimate activation memory with and without gradient checkpointing. Based on NVIDIA selective recomputation research.
Machine LearningAI Fairness & Bias Calculator
Calculate demographic parity, equalized odds, equal opportunity, and disparate impact ratio. Based on IBM AIF360 and Microsoft Fairlearn.
Machine LearningAttention Head Configuration Calculator
Configure MHA, MQA, and GQA attention. Calculate head counts, dimensions, KV cache savings, and memory per attention type.
Machine Learning