Model Parallelism
Distributing a single AI model across multiple GPUs or machines, necessary when a model is too large to fit in the memory of a single device.
Types
Tensor parallelism: Split individual layers across devices (e.g., split a large matrix multiplication). Pipeline parallelism: Assign different layers to different devices. Expert parallelism: Route different experts in MoE models to different devices.
Challenges
Communication overhead between devices. Pipeline bubbles (idle time waiting for other stages). Memory imbalance across devices. Complex implementation and debugging. Interaction with data parallelism (often used together).
Frameworks
Megatron-LM (NVIDIA) pioneered tensor parallelism for transformers. DeepSpeed ZeRO partitions optimizer states, gradients, and parameters. PyTorch FSDP provides built-in sharding. These frameworks abstract away most complexity.