AI Glossary

Data Parallelism

A distributed training strategy that replicates the model on multiple GPUs, with each GPU processing a different subset of the training data.

How It Works

Each GPU has a complete copy of the model. A mini-batch is split across GPUs, each computes gradients on its portion, then gradients are averaged across all GPUs before updating weights. This scales throughput linearly with the number of GPUs.

FSDP

Fully Sharded Data Parallelism (PyTorch FSDP) shards model parameters, gradients, and optimizer states across GPUs, dramatically reducing per-GPU memory. It combines data parallelism efficiency with the memory savings of model parallelism.

← Back to AI Glossary

Data Parallelism

How It Works

FSDP

Related Articles

Distributed Training: Scaling Deep Learning Across GPUs

AI Agents for Data Analysis: Automating Insights

Data Augmentation: Getting More from Less Data

Data Labeling Platforms: Building Training Datasets Efficiently

Related Concepts