AI Glossary

Mixture of Experts

A model architecture that routes each input to a subset of specialized sub-networks (experts) for efficient scaling.

Overview

Mixture of Experts (MoE) is a neural network architecture where a routing network (gate) selects a subset of specialized sub-networks (experts) to process each input. Only the selected experts are activated, allowing the total model to have many more parameters while keeping computational cost per input manageable.

Key Details

Modern MoE models like Mixtral, Switch Transformer, and reportedly GPT-4 use sparse MoE layers within transformers, where each token is routed to the top-k experts (typically k=1 or k=2). This enables training models with trillions of parameters at the computational cost of a much smaller dense model. Key challenges include load balancing across experts, training instability, and higher memory requirements despite lower compute.

Related Concepts

transformer • model sharding • scaling law

← Back to AI Glossary