AI Glossary

Distributed Training

Training AI models across multiple GPUs or machines simultaneously, essential for large models that exceed the memory and compute capacity of a single device.

Parallelism Strategies

Data parallelism: Each GPU trains on different data batches, gradients are synchronized. Tensor parallelism: Individual layers split across GPUs. Pipeline parallelism: Different layers on different GPUs. Expert parallelism: For mixture-of-experts models.

Frameworks

PyTorch FSDP (Fully Sharded Data Parallel), DeepSpeed (Microsoft), Megatron-LM (NVIDIA), JAX/XLA (Google). These handle the complexities of communication, synchronization, and memory management.

Challenges

Communication overhead between devices. Maintaining training stability at scale. Hardware failures in large clusters. Efficient utilization of expensive GPU resources. Debugging distributed systems is significantly harder than single-device training.

← Back to AI Glossary

Last updated: March 5, 2026