GPU Cluster
A computing infrastructure consisting of many interconnected GPUs used for training large AI models, the essential hardware for frontier AI development.
Scale
Frontier model training uses clusters of thousands to tens of thousands of GPUs. Meta's training infrastructure exceeds 600,000 GPUs. A single H100 costs ~$30-40K; a 10,000 GPU cluster costs $300-400M+ before networking and operations.
Key Components
GPU nodes (8 GPUs per node), high-bandwidth interconnects (InfiniBand, NVLink), high-speed storage (parallel file systems), cooling infrastructure, and cluster management software.