MoE Model Sparsity

Two definitions of sparsity across Mixture-of-Experts language models.
Definition 1: Active Parameters / Total Parameters  ·  Definition 2: Selected Experts (incl. shared) / Total Experts

5

Parameter Sparsity

Active parameters vs total parameters. Diagonal lines show the activation ratio (active/total).

Expert Sparsity

Selected experts (top-k + shared) vs total experts (routed + shared). Diagonal lines show the selection ratio.

Sparsity Evolution

Activation ratio over time. Bubble size reflects total parameter count. Models are trending toward higher sparsity (lower activation ratio).

Model Scale Over Time

Total parameters (bars) and active parameters (dots) for each MoE model, ordered by release date.

Sparsity Distribution

Distribution of activation ratios across all models. Left = more sparse, right = less sparse.

Data

All model configurations. Sources: HuggingFace config.json files, official technical reports.

* LongCat Flash/Lite also have zero-computation (identity) experts in the routing pool (256 / 128 respectively) enabling dynamic compute allocation.
* LongCat Flash Lite allocates ~31B of its 68.5B total params to an N-gram embedding table.
* Mistral Large 3 params are LLM-only (673B/39B); full multimodal model is 675B/41B including 2.5B vision encoder.
* GLM-5 shared expert count estimated from GLM family pattern (all prior GLM MoE models use 1 shared expert).
* ERNIE 5 expert architecture not publicly disclosed; only total/active params are known (<3% activation of 2.4T).
* ERNIE 5 date is the announcement date (Baidu World 2025); weights remain closed.
* GLM-5 has not been publicly released yet (expected ~Feb 2026); config estimated from prior GLM models.