Model Sharding
Splitting a large model across multiple devices so each device holds only a portion of the parameters.
Overview
Model sharding distributes a large neural network's parameters across multiple GPUs or machines, enabling training and inference of models that are too large to fit on a single device. Different sharding strategies include tensor parallelism (splitting individual layers across devices), pipeline parallelism (assigning different layers to different devices), and expert parallelism (distributing mixture-of-experts across devices).
Key Details
Effective sharding requires careful consideration of communication overhead, load balancing, and memory distribution. Frameworks like Megatron-LM, DeepSpeed, FSDP (Fully Sharded Data Parallel), and JAX's pjit provide automated sharding strategies. Model sharding is essential for training and serving frontier models with hundreds of billions of parameters.
Related Concepts
tensor parallelism • pipeline parallelism • distributed training