Scale model capacity without proportional compute. More experts = more knowledge; routing activates only a subset per token.

Mixtral 8×7B has 46.7B params but only 12.9B active per token—similar compute to 13B dense, much larger capacity.

5 more

OPTIMIZATIONArchitecture & DesignML Calculator

🧩

MoE Efficiency

Total vs active parameters for Mixture of Experts models. Sparse activation: scale capacity without proportional compute.

Concept Fundamentals

total/E × top-k

Active Params

Sparse activation

Token → expert

Routing

Gating network

Load balancing

Capacity Factor

Expert utilization

Efficient scaling

Application

More params, same FLOPs

Total vs Active ParamsSparse activation — scale without proportional compute

Why This ML Metric Matters

Why: MoE models (Mixtral, DeepSeek, Switch) route each token to a subset of experts. Compute scales with active params; memory holds all.

How: Total = shared + experts×expertSize. Active = shared + topK×expertSize. Efficiency = active/total.

Sources:Jiang et al. 2024 - Mixtral of ExpertsFedus et al. 2022 - Switch Transformers

🧩

SPARSE ACTIVATION — SCALE WITHOUT PROPORTIONAL COMPUTE

Total vs Active Params for MoE Models

Based on Mixtral (Jiang 2024), Switch (Fedus 2022), DeepSeek-V2, Sparsely-Gated MoE (Shazeer 2017).

KV Cache →GPU VRAM →

📊 Quick Examples — Click to Load

Inputs

Total ExpertsN experts

Active Experts (top-K)routed per token

Expert Size (B)params per expert

Shared Params (B)embeddings, attention

Routing Overhead (%)router + aux loss

Load Balance Loss Weightauxiliary loss

moe-efficiency.sh

CALCULATED

Total Params

56.3B

Active Params

14.3B

Efficiency

25.4%

Memory

56.3B

Compute

14.3B

Routing Overhead

0.28B

MoE Efficiency Calculator

Efficiency

25.4%

56B total→14B active|8 experts, top-2

numbervibe.com/calculators/machine-learning/moe-efficiency-calculator

Total vs Active Params

Expert Utilization

For educational and informational purposes only. Verify with a qualified professional.

🤖 AI & ML Facts

🧩

Mixtral 8×7B: 46.7B total, 12.9B active per token — ~28% efficiency

— Jiang et al.

⚡

Compute ∝ active params only; memory holds all experts

— MoE architecture

📊

DeepSeek-V2: 236B total, 21B active — MLA routing

— DeepSeek-AI

🔀

Load balancing loss prevents router collapse (few experts dominating)

— Shazeer 2017

📋 Key Takeaways

• MoE models scale capacity (total params) without proportional compute — only active experts run per token
• Efficiency = activeParams / totalParams — Mixtral ~28%, DeepSeek-V2 ~9%, Switch top-1 ~1.6%
• Memory holds all experts; compute is proportional to active experts only
• Load balancing loss prevents expert collapse — critical for training
• Top-K routing: higher K = more compute, better quality; lower K = faster, sparser
• Shared params (embeddings, attention) are always active — reduce expert size for higher efficiency

💡 Did You Know

📊Mixtral 8×7B: 46.7B total, 12.9B active — ~28% efficiency, 7B dense equivalent compute

⚡Switch Transformers use top-1 routing — 1.6T params with ~25B active per token

🔧DeepSeek-V2: 236B total, 21B active — MLA + MoE hybrid architecture

🤗Load balancing loss (Shazeer 2017) prevents all tokens routing to same expert

🎯Grok-1: 314B params, 8 experts, top-2 — xAI MoE at scale

📐Expert utilization = topK/N — 2/8 = 25% of experts active per token

🔀Routing overhead: router network + load balance loss — typically <1% of params

📈More experts → higher capacity, but routing complexity and load imbalance increase

🧠Sparse activation: only k experts compute per token — key to MoE efficiency

🌐Expert parallelism: experts sharded across GPUs for training and inference

📖 How It Works

1. Expert Layers

MoE replaces dense FFN with N expert FFNs. Each token is routed to top-K experts by a router network.

2. Total vs Active Params

Total = shared + N × expert_size. Active = shared + K × expert_size. Only active experts compute per token.

3. Memory vs Compute

Memory: all experts loaded. Compute: proportional to active params only. Efficiency = active/total.

4. Load Balancing

Load balance loss encourages uniform expert usage. Without it, experts can collapse (all tokens → few experts).

5. Routing Overhead

Router network + auxiliary losses add small overhead. Typically <1% of total params.

6. Scaling

More experts → more capacity. Top-K choice trades quality (higher K) vs speed (lower K).

🎯 Expert Tips

Balance top-K and experts

Top-2 with 8 experts (Mixtral) balances quality and efficiency. Top-1 (Switch) maximizes sparsity.

Load balance loss is critical

Use 0.01–0.001 weight. Prevents expert collapse during training.

Minimize shared params

Shared params are always active. Smaller shared → higher efficiency for same total.

Expert parallelism for scale

Shard experts across GPUs. Each GPU holds subset of experts; routing selects which to run.

⚖️ MoE Architecture Comparison

Model	Experts	Top-K	Total (B)	Active (B)	Efficiency
Mixtral 8×7B	8	2	46.7	12.9	~28%
DeepSeek-V2	64	6	236	21	~9%
Switch-C	64	1	1576	25	~1.6%
Grok-1	8	2	314	~80	~25%

❓ Frequently Asked Questions

What is MoE efficiency?

Efficiency = activeParams / totalParams. It measures how much of the model actually computes per token. Higher efficiency means less "wasted" capacity.

Why does memory hold all experts?

At inference, the router selects which experts to run, but all expert weights must be in VRAM. Only the selected experts perform computation.

What is load balancing loss?

Auxiliary loss that encourages uniform expert usage. Without it, routers can collapse — all tokens route to few experts, wasting capacity.

Top-1 vs top-2 routing?

Top-1 (Switch): maximum sparsity, fastest. Top-2 (Mixtral): better quality, 2× compute per token. Top-6 (DeepSeek): higher quality, more compute.

How does MoE compare to dense?

Dense: all params compute every token. MoE: only active params compute. Same quality with fewer FLOPs, but more memory (all experts).

What is routing overhead?

Params in the router network + auxiliary losses. Typically <1% of total. This calculator models it as a percentage of total.

When to use more experts?

More experts → more capacity, better specialization. But routing gets harder, load imbalance increases. 8–64 common for production.

How accurate is this calculator?

Formulas are standard. Real models may have layer-wise variation (experts per layer). Use for architecture comparison and planning.

Shared params impact?

Shared (embeddings, attention) are always active. Large shared reduces efficiency — consider expert-only MoE layers.

MoE inference optimization?

Expert parallelism, expert caching, and batch routing. See GPU VRAM and Inference Throughput calculators for deployment sizing.

📊 MoE by the Numbers

8×7B

Mixtral experts × size

top-2

Common routing (Mixtral)

~28%

Mixtral efficiency

1.6T

Switch-C scale

📚 Official Sources

Jiang et al. 2024 - Mixtral of Experts ↗

8 experts, top-2 routing, 46.7B total, 12.9B active

Updated: 2024

Fedus et al. 2022 - Switch Transformers ↗

Switch routing, expert parallelism, scaling to 1.6T params

Updated: 2022

DeepSeek-AI 2024 - DeepSeek-V2 ↗

MoE with MLA, 236B total, 21B active per token