AI Glossary

GPU Cluster

A computing infrastructure consisting of many interconnected GPUs used for training large AI models, the essential hardware for frontier AI development.

Scale

Frontier model training uses clusters of thousands to tens of thousands of GPUs. Meta's training infrastructure exceeds 600,000 GPUs. A single H100 costs ~$30-40K; a 10,000 GPU cluster costs $300-400M+ before networking and operations.

Key Components

GPU nodes (8 GPUs per node), high-bandwidth interconnects (InfiniBand, NVLink), high-speed storage (parallel file systems), cooling infrastructure, and cluster management software.

← Back to AI Glossary

Last updated: March 5, 2026