MoE Efficiency
Total vs active parameters for Mixture of Experts models. Sparse activation: scale capacity without proportional compute.
Why This ML Metric Matters
Why: MoE models (Mixtral, DeepSeek, Switch) route each token to a subset of experts. Compute scales with active params; memory holds all.
How: Total = shared + expertsรexpertSize. Active = shared + topKรexpertSize. Efficiency = active/total.
Total vs Active Params for MoE Models
Based on Mixtral (Jiang 2024), Switch (Fedus 2022), DeepSeek-V2, Sparsely-Gated MoE (Shazeer 2017).
๐ Quick Examples โ Click to Load
Inputs
Total vs Active Params
Expert Utilization
โ ๏ธFor educational and informational purposes only. Verify with a qualified professional.
๐ค AI & ML Facts
Mixtral 8ร7B: 46.7B total, 12.9B active per token โ ~28% efficiency
โ Jiang et al.
Compute โ active params only; memory holds all experts
โ MoE architecture
DeepSeek-V2: 236B total, 21B active โ MLA routing
โ DeepSeek-AI
Load balancing loss prevents router collapse (few experts dominating)
โ Shazeer 2017
๐ Key Takeaways
- โข MoE models scale capacity (total params) without proportional compute โ only active experts run per token
- โข Efficiency = activeParams / totalParams โ Mixtral ~28%, DeepSeek-V2 ~9%, Switch top-1 ~1.6%
- โข Memory holds all experts; compute is proportional to active experts only
- โข Load balancing loss prevents expert collapse โ critical for training
- โข Top-K routing: higher K = more compute, better quality; lower K = faster, sparser
- โข Shared params (embeddings, attention) are always active โ reduce expert size for higher efficiency
๐ก Did You Know
๐ How It Works
1. Expert Layers
MoE replaces dense FFN with N expert FFNs. Each token is routed to top-K experts by a router network.
2. Total vs Active Params
Total = shared + N ร expert_size. Active = shared + K ร expert_size. Only active experts compute per token.
3. Memory vs Compute
Memory: all experts loaded. Compute: proportional to active params only. Efficiency = active/total.
4. Load Balancing
Load balance loss encourages uniform expert usage. Without it, experts can collapse (all tokens โ few experts).
5. Routing Overhead
Router network + auxiliary losses add small overhead. Typically <1% of total params.
6. Scaling
More experts โ more capacity. Top-K choice trades quality (higher K) vs speed (lower K).
๐ฏ Expert Tips
Balance top-K and experts
Top-2 with 8 experts (Mixtral) balances quality and efficiency. Top-1 (Switch) maximizes sparsity.
Load balance loss is critical
Use 0.01โ0.001 weight. Prevents expert collapse during training.
Minimize shared params
Shared params are always active. Smaller shared โ higher efficiency for same total.
Expert parallelism for scale
Shard experts across GPUs. Each GPU holds subset of experts; routing selects which to run.
โ๏ธ MoE Architecture Comparison
| Model | Experts | Top-K | Total (B) | Active (B) | Efficiency |
|---|---|---|---|---|---|
| Mixtral 8ร7B | 8 | 2 | 46.7 | 12.9 | ~28% |
| DeepSeek-V2 | 64 | 6 | 236 | 21 | ~9% |
| Switch-C | 64 | 1 | 1576 | 25 | ~1.6% |
| Grok-1 | 8 | 2 | 314 | ~80 | ~25% |
โ Frequently Asked Questions
What is MoE efficiency?
Efficiency = activeParams / totalParams. It measures how much of the model actually computes per token. Higher efficiency means less "wasted" capacity.
Why does memory hold all experts?
At inference, the router selects which experts to run, but all expert weights must be in VRAM. Only the selected experts perform computation.
What is load balancing loss?
Auxiliary loss that encourages uniform expert usage. Without it, routers can collapse โ all tokens route to few experts, wasting capacity.
Top-1 vs top-2 routing?
Top-1 (Switch): maximum sparsity, fastest. Top-2 (Mixtral): better quality, 2ร compute per token. Top-6 (DeepSeek): higher quality, more compute.
How does MoE compare to dense?
Dense: all params compute every token. MoE: only active params compute. Same quality with fewer FLOPs, but more memory (all experts).
What is routing overhead?
Params in the router network + auxiliary losses. Typically <1% of total. This calculator models it as a percentage of total.
When to use more experts?
More experts โ more capacity, better specialization. But routing gets harder, load imbalance increases. 8โ64 common for production.
How accurate is this calculator?
Formulas are standard. Real models may have layer-wise variation (experts per layer). Use for architecture comparison and planning.
Shared params impact?
Shared (embeddings, attention) are always active. Large shared reduces efficiency โ consider expert-only MoE layers.
MoE inference optimization?
Expert parallelism, expert caching, and batch routing. See GPU VRAM and Inference Throughput calculators for deployment sizing.
๐ MoE by the Numbers
๐ Official Sources
โ ๏ธ Disclaimer: This calculator provides estimates for educational and architecture planning. Formulas follow standard MoE literature (Mixtral, Switch, DeepSeek). Real models may have layer-wise expert counts, variable expert sizes, and implementation-specific overhead. For production, validate with actual model configs and profiling.