Compute Cluster
A group of interconnected computers (typically GPU servers) working together to train large AI models, connected by high-speed networking.
Hardware
Modern clusters use NVIDIA H100/H200 GPUs, connected via NVLink (within nodes) and InfiniBand (between nodes). A frontier model training cluster may have 10,000-100,000 GPUs. Total cost can exceed $1 billion.
Networking
GPU-to-GPU communication bandwidth is critical for distributed training. InfiniBand provides 400-800 Gbps. Network topology (fat-tree, dragonfly) affects collective communication performance. Network failures are common at scale.
Operations
Job scheduling (SLURM, Kubernetes). Checkpointing for fault tolerance. Power and cooling requirements are enormous. Training runs can take weeks to months. Hardware failures are expected and must be handled gracefully.