AI Glossary

Data Parallelism

A distributed training strategy that replicates the model on multiple GPUs, with each GPU processing a different subset of the training data.

How It Works

Each GPU has a complete copy of the model. A mini-batch is split across GPUs, each computes gradients on its portion, then gradients are averaged across all GPUs before updating weights. This scales throughput linearly with the number of GPUs.

FSDP

Fully Sharded Data Parallelism (PyTorch FSDP) shards model parameters, gradients, and optimizer states across GPUs, dramatically reducing per-GPU memory. It combines data parallelism efficiency with the memory savings of model parallelism.

← Back to AI Glossary

Last updated: March 5, 2026