Tensor Parallelism
A distributed training strategy that splits individual layers of a model across multiple GPUs, enabling training of models too large to fit on a single device.
How It Works
Large matrix operations (attention, feedforward layers) are split across GPUs. Each GPU computes a portion of the result, then results are combined via all-reduce communication. This requires fast inter-GPU connections (NVLink).
Usage
Essential for training and serving the largest models (70B+ parameters). Combined with pipeline parallelism and data parallelism in 3D parallelism strategies. Megatron-LM pioneered tensor parallelism for transformers.