Pipeline Parallelism
A distributed training strategy that splits a model's layers across multiple GPUs, with each GPU processing a different stage of the forward/backward pass.
How It Works
Model layers are divided into stages across GPUs. Micro-batches are pipelined: while GPU 1 processes micro-batch 2, GPU 2 processes micro-batch 1. This keeps all GPUs busy simultaneously.
In Practice
Pipeline parallelism is combined with data parallelism and tensor parallelism (3D parallelism) to train the largest models. DeepSpeed, Megatron-LM, and FSDP implement these strategies.