Data Parallelism
A distributed training strategy that replicates the model on multiple GPUs, with each GPU processing a different subset of the training data.
How It Works
Each GPU has a complete copy of the model. A mini-batch is split across GPUs, each computes gradients on its portion, then gradients are averaged across all GPUs before updating weights. This scales throughput linearly with the number of GPUs.
FSDP
Fully Sharded Data Parallelism (PyTorch FSDP) shards model parameters, gradients, and optimizer states across GPUs, dramatically reducing per-GPU memory. It combines data parallelism efficiency with the memory savings of model parallelism.