Modern AI models are too large and training datasets too vast for a single GPU. GPT-4 was trained on thousands of GPUs over months. Even fine-tuning a 70-billion parameter model requires distributing the workload across multiple devices. Distributed training is the collection of techniques that makes this possible, enabling deep learning to scale from a single GPU to thousands working in concert.
Why Distribute Training
There are two fundamental reasons to distribute training. First, speed: a model that takes a week to train on one GPU might train in a day on eight GPUs. Second, capacity: models with billions of parameters simply do not fit in the memory of a single GPU. Distributed training addresses both constraints through different parallelism strategies.
Data Parallelism
Data parallelism is the simplest and most common distributed training strategy. Each GPU holds a complete copy of the model. The training batch is split across GPUs, each processes its shard independently, and then gradients are averaged across all GPUs before updating the model weights. This effectively multiplies your batch size by the number of GPUs.
How It Works
- Each GPU receives a copy of the model weights
- A large batch of data is split into mini-batches, one per GPU
- Each GPU computes the forward pass and loss on its mini-batch
- Each GPU computes gradients through backpropagation
- Gradients are synchronized across all GPUs using an all-reduce operation
- Each GPU applies the averaged gradients to update its model copy
PyTorch's DistributedDataParallel (DDP) implements this pattern efficiently. DDP overlaps gradient communication with backward computation, hiding much of the communication overhead. For most models that fit on a single GPU, DDP is all you need for multi-GPU training.
"Data parallelism is the bread and butter of distributed training. If your model fits on one GPU, start with DDP. It is simple to implement, scales nearly linearly, and covers the vast majority of use cases."
Model Parallelism
When a model is too large to fit on a single GPU, you must split it across devices. Model parallelism divides the model's layers or tensors across GPUs so that each device holds only a fraction of the parameters.
Tensor Parallelism
Tensor parallelism splits individual layers across GPUs. For example, a large matrix multiplication can be partitioned so that each GPU computes a portion of the output and the results are combined. This requires tight synchronization and high-bandwidth interconnects (like NVLink) because GPUs must exchange intermediate activations at every layer. Megatron-LM from NVIDIA is the standard framework for tensor parallelism in transformer models.
Pipeline Parallelism
Pipeline parallelism assigns different layers to different GPUs, creating a pipeline where data flows through devices sequentially. The challenge is that naive pipeline execution wastes GPU time while devices wait for their input. Micro-batching and pipeline scheduling (GPipe's fill-drain approach or PipeDream's 1F1B schedule) mitigate this pipeline bubble overhead.
Key Takeaway
Data parallelism scales by replicating the model. Model parallelism scales by splitting the model. Most large-scale training combines both: tensor parallelism within a node (leveraging NVLink), pipeline parallelism across nodes, and data parallelism across groups of nodes.
Fully Sharded Data Parallel (FSDP)
PyTorch's FSDP represents a middle ground between pure data parallelism and model parallelism. Instead of each GPU holding a complete model copy, FSDP shards the model parameters, gradients, and optimizer states across GPUs. Each GPU holds only a fraction of the model's memory footprint but reconstructs the full parameters on-the-fly when needed for computation.
FSDP is inspired by Microsoft's ZeRO (Zero Redundancy Optimizer) stages. At its most aggressive setting, FSDP can reduce per-GPU memory usage to roughly 1/N of the full model (where N is the number of GPUs), enabling training of much larger models without the complexity of manual model parallelism.
DeepSpeed and Megatron
DeepSpeed
Microsoft's DeepSpeed library provides a comprehensive suite of distributed training optimizations. Its ZeRO optimizer comes in three stages of increasing memory efficiency. DeepSpeed also offers ZeRO-Offload (offloading optimizer states to CPU memory), ZeRO-Infinity (extending offloading to NVMe storage), and inference optimizations for serving large models.
Megatron-LM
NVIDIA's Megatron-LM specializes in training very large transformer models. It combines tensor parallelism, pipeline parallelism, and data parallelism in a 3D parallelism approach that has been used to train models with hundreds of billions of parameters. Megatron-DeepSpeed combines the best of both libraries.
Communication Patterns
The efficiency of distributed training depends heavily on communication overhead. The key collective operations are:
- All-Reduce: Each device has a tensor; the operation returns the sum (or average) of all tensors to every device. Used for gradient synchronization in data parallelism
- All-Gather: Each device has a shard; the operation returns the concatenated full tensor to every device. Used in FSDP for parameter reconstruction
- Reduce-Scatter: Each device has a tensor; the operation reduces (sums) and distributes shards. Used in FSDP for gradient reduction
NVIDIA's NCCL library provides highly optimized implementations of these operations for GPU clusters. Network topology matters enormously: NVLink provides 900 GB/s between GPUs on the same node, while InfiniBand provides 200-400 Gb/s between nodes. This is why training systems typically use tensor parallelism within nodes and data or pipeline parallelism across nodes.
Practical Considerations
Learning Rate Scaling
When scaling batch size with data parallelism, the learning rate typically needs to increase proportionally. The linear scaling rule multiplies the base learning rate by the number of GPUs. However, this can destabilize training at very large batch sizes, so techniques like learning rate warmup (gradually increasing the learning rate over the first few hundred steps) are essential.
Checkpointing
Large distributed training runs are expensive and vulnerable to hardware failures. Regular checkpointing saves model weights, optimizer states, and training metadata so that training can resume from the last checkpoint rather than starting over. Frameworks like DeepSpeed provide efficient distributed checkpointing that saves and loads sharded states in parallel.
Debugging Distributed Training
Debugging distributed code is inherently difficult. Common issues include hanging processes (usually caused by communication deadlocks), inconsistent results (from incorrect gradient synchronization), and memory imbalances (from uneven model partitioning). Start by verifying your setup works with 2 GPUs before scaling to larger clusters.
Distributed training is a deep field that continues to evolve as models grow larger and hardware advances. The fundamental principles of data parallelism, model parallelism, and efficient communication remain constant even as the specific tools and techniques improve.
Key Takeaway
Start with DDP for models that fit on one GPU. Use FSDP or DeepSpeed ZeRO when models approach GPU memory limits. For the largest models, combine tensor, pipeline, and data parallelism. Always validate your distributed setup at small scale before investing in large training runs.
