AI Glossary

Compute Cluster

A group of interconnected computers (typically GPU servers) working together to train large AI models, connected by high-speed networking.

Hardware

Modern clusters use NVIDIA H100/H200 GPUs, connected via NVLink (within nodes) and InfiniBand (between nodes). A frontier model training cluster may have 10,000-100,000 GPUs. Total cost can exceed $1 billion.

Networking

GPU-to-GPU communication bandwidth is critical for distributed training. InfiniBand provides 400-800 Gbps. Network topology (fat-tree, dragonfly) affects collective communication performance. Network failures are common at scale.

Operations

Job scheduling (SLURM, Kubernetes). Checkpointing for fault tolerance. Power and cooling requirements are enormous. Training runs can take weeks to months. Hardware failures are expected and must be handled gracefully.

← Back to AI Glossary