Two definitions of sparsity across Mixture-of-Experts language models.
Definition 1: Active Parameters / Total Parameters ·
Definition 2: Selected Experts (incl. shared) / Total Experts
Active parameters vs total parameters. Diagonal lines show the activation ratio (active/total).
Selected experts (top-k + shared) vs total experts (routed + shared). Diagonal lines show the selection ratio.
All model configurations. Sources: HuggingFace config.json files, official technical reports.
* LongCat Flash/Lite also have zero-computation (identity) experts in the routing pool (256 / 128 respectively) enabling dynamic compute allocation.
* LongCat Flash Lite allocates ~31B of its 68.5B total params to an N-gram embedding table.
* Mistral Large 3 params are LLM-only (673B/39B); full multimodal model is 675B/41B including 2.5B vision encoder.
* GLM-5 shared expert count estimated from GLM family pattern (all prior GLM MoE models use 1 shared expert).
* ERNIE 5 expert architecture not publicly disclosed; only total/active params are known (<3% activation of 2.4T).