Skip to content
Technical GlossaryDeep Learning

Mixture-of-Experts Transformer

A Transformer approach that improves scaling efficiency by activating selected expert subnetworks rather than the full model on every input.

Mixture-of-Experts Transformer architectures aim to increase model capacity without requiring all parameters to be active for every input. A routing mechanism decides which expert subnetworks should process the incoming example. This creates a new balance between computational efficiency and model scale. In large-scale systems, it embodies the idea of efficient specialization.