AI Hardware & Infrastructure

The physical foundation powering artificial intelligence — from GPUs and custom silicon to data centers and cloud platforms that make modern AI possible.

1. The GPU Revolution

Why GPUs Dominate AI Training

Graphics Processing Units (GPUs) were originally designed for rendering pixels in video games, but their architecture turned out to be perfectly suited for the matrix multiplications that underpin deep learning. While a CPU excels at sequential, complex tasks with a handful of powerful cores, a GPU distributes thousands of simpler computations across thousands of cores simultaneously.

Modern deep learning involves multiplying enormous matrices — a forward pass through a transformer model with billions of parameters requires trillions of floating-point operations. GPUs handle these through massive parallelism, high memory bandwidth, and specialized tensor cores designed specifically for mixed-precision matrix math.

CPU vs GPU: Architectural Comparison
CPU — Few Powerful Cores
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
Core 8

8-64 cores optimized for sequential tasks, complex branching, and low-latency single-thread performance.

GPU — Thousands of Small Cores

10,000+ CUDA/Tensor cores optimized for parallel matrix operations, high throughput, and massive data parallelism.

NVIDIA's Dominance in AI

NVIDIA has established an unassailable lead in AI compute, not just through hardware but through its CUDA software ecosystem, which has become the de facto standard for AI development. Virtually every major deep learning framework — PyTorch, TensorFlow, JAX — is optimized for CUDA, creating a powerful moat.

NVIDIA A100 (Ampere, 2020)

The workhorse that trained most of the foundation models we use today. The A100 introduced third-generation Tensor Cores with TF32 precision and multi-instance GPU (MIG) technology, allowing a single GPU to be partitioned into up to seven independent instances. Available in 40 GB and 80 GB HBM2e variants, the A100 was the first GPU purpose-built for the AI data center era.

NVIDIA H100 (Hopper, 2023)

A generational leap that roughly tripled AI training performance over the A100. The H100 introduced the Transformer Engine with FP8 precision, dramatically accelerating transformer-based workloads. Its fourth-generation NVLink provides 900 GB/s of GPU-to-GPU bandwidth, and it became the most sought-after chip in computing history, with wait times stretching to months.

NVIDIA H200 (Hopper, 2024)

An enhanced H100 with 141 GB of HBM3e memory (up from 80 GB HBM3), providing 4.8 TB/s of memory bandwidth. The H200 is particularly impactful for large language model inference, where memory capacity determines the maximum model size that can fit on a single GPU.

NVIDIA B100/B200 (Blackwell, 2024-2025)

NVIDIA's Blackwell architecture represents another major leap. The B200 features a dual-die design with 192 GB of HBM3e, delivering up to 2.5x the training performance and 5x the inference performance of the H100. The GB200 "superchip" pairs two B200 GPUs with a Grace CPU via NVLink-C2C, forming the building block of next-generation AI supercomputers.

AMD MI300X: The Challenger

AMD's Instinct MI300X is the most credible alternative to NVIDIA's data center GPUs. With 192 GB of HBM3 memory — significantly more than the H100's 80 GB — the MI300X is particularly attractive for large model inference where memory capacity is the bottleneck. AMD's ROCm software stack has matured considerably, with PyTorch support improving, though it still trails CUDA in ecosystem breadth.

GPU Specifications Comparison

GPU Architecture VRAM Memory BW FP16 TFLOPS Interconnect TDP
NVIDIA A100 Ampere 80 GB HBM2e 2.0 TB/s 312 NVLink 3 (600 GB/s) 300W
NVIDIA H100 SXM Hopper 80 GB HBM3 3.35 TB/s 990 NVLink 4 (900 GB/s) 700W
NVIDIA H200 Hopper 141 GB HBM3e 4.8 TB/s 990 NVLink 4 (900 GB/s) 700W
NVIDIA B200 Blackwell 192 GB HBM3e 8.0 TB/s 2,250 NVLink 5 (1,800 GB/s) 1,000W
AMD MI300X CDNA 3 192 GB HBM3 5.3 TB/s 1,307 Infinity Fabric 750W

Key Insight: Raw TFLOPS numbers don't tell the whole story. Real-world performance depends heavily on software optimization, memory bandwidth utilization, and interconnect speed for multi-GPU workloads. NVIDIA's CUDA ecosystem often delivers 20-40% higher utilization rates than competing platforms due to mature tooling and compiler optimizations.

2. AI Accelerators & Custom Silicon

While GPUs are general-purpose parallel processors adapted for AI, a growing number of companies are designing application-specific integrated circuits (ASICs) purpose-built for machine learning workloads. These custom chips sacrifice flexibility for efficiency, often delivering better performance-per-watt for specific AI tasks.

Google TPU (v1 – v5e)

Google's Tensor Processing Units pioneered custom AI silicon. TPU v1 (2016) was inference-only; by TPU v4 (2022), each pod delivers 1.1 exaflops. TPU v5e (2023) optimizes for cost-efficient training and inference at scale. Designed around a systolic array architecture that excels at matrix multiplications, TPUs power Google Search, Gmail, Translate, and Gemini.

Available through Google Cloud TPU, making custom silicon accessible without owning the hardware.

Google Cloud

Apple Neural Engine

Integrated into Apple's M-series and A-series chips, the Neural Engine handles on-device ML tasks like Face ID, computational photography, Siri, and Apple Intelligence features. The M4 Neural Engine delivers up to 38 TOPS (trillion operations per second), enabling powerful AI without cloud dependency.

Represents the edge AI paradigm: processing happens locally on the device for privacy, low latency, and offline capability.

On-Device AI

AWS Trainium & Inferentia

Trainium2 is Amazon's custom training chip, offering up to 4x better price-performance than comparable GPU instances for supported workloads. Inferentia2 targets inference with high throughput and low latency at reduced cost. Both are tightly integrated with AWS SageMaker and the Neuron SDK.

AWS uses these chips to provide cost-competitive alternatives to NVIDIA GPU instances for customers willing to use the Neuron compiler.

AWS Custom Silicon

Intel Gaudi 3

Intel's Gaudi accelerators (acquired through Habana Labs) target cost-effective AI training. Gaudi 3 offers competitive performance with H100 for training workloads at a lower price point. Features 96 GB HBM2e and integrated Ethernet networking, eliminating the need for separate InfiniBand switches.

Intel AI

Cerebras Wafer-Scale Engine

The most radical approach to AI hardware: a single chip the size of an entire silicon wafer. The WSE-3 contains 4 trillion transistors, 900,000 AI cores, and 44 GB of on-chip SRAM with 21 PB/s of memory bandwidth. By eliminating chip-to-chip communication bottlenecks, Cerebras achieves extraordinary performance for sparse and large-model training.

Wafer-Scale

Groq LPU (Language Processing Unit)

Groq's LPU takes a fundamentally different approach: a deterministic architecture with no external memory (HBM). All data flows through on-chip SRAM in a predictable pattern, eliminating the memory bandwidth bottleneck. This enables extraordinary inference speeds — generating 500+ tokens per second for LLMs — at predictable latency.

Optimized exclusively for inference, not training. Ideal for real-time conversational AI.

Inference-First

The Build vs. Buy Decision: Companies like Google, Amazon, and Apple build custom silicon to reduce dependency on NVIDIA and optimize for their specific workloads. However, the cost of designing, fabricating, and maintaining custom chips runs into billions of dollars — making it viable only for hyperscalers processing AI at enormous scale.

3. Training Infrastructure

Training a frontier AI model is one of the most demanding engineering challenges in computing. It requires orchestrating thousands of GPUs across a data center, moving petabytes of data, and sustaining computation for weeks or months without failure.

Data Center Design for AI

AI data centers differ fundamentally from traditional cloud data centers in three critical dimensions:

GPU Clusters & Interconnects

Modern training clusters are built in a hierarchy of interconnects, each optimized for different communication distances:

Tensor Cores
Within GPU
NVLink
GPU-to-GPU
900 GB/s
NVSwitch
Within Node
8 GPUs
InfiniBand
Node-to-Node
400 Gb/s
Data Center Fabric
10,000+ GPUs

Distributed Training Strategies

No single GPU can train a frontier model alone. Distributed training splits the workload using several complementary strategies:

The Cost of Training Frontier Models

Model Organization Estimated Training Cost GPU Hours (approx.)
GPT-3 (175B) OpenAI $4-5M 3.6M A100-hours
GPT-4 OpenAI $50-100M+ Tens of millions
Llama 3.1 405B Meta $30-60M 30.8M H100-hours
Gemini Ultra Google $50-100M+ TPU v5e pods
Claude 3.5 Sonnet Anthropic $30-50M (est.) Not disclosed

Beyond Compute Cost: The total cost of training includes data curation, human annotation (RLHF), engineering salaries, electricity, and the opportunity cost of GPU allocation. For frontier labs, the fully loaded cost of a single training run can exceed $200M when all factors are included.

4. Inference Infrastructure

While training captures the headlines, inference — running trained models to generate predictions — accounts for 80-90% of total AI compute spending in production. Optimizing inference is where hardware meets real-world economics.

Latency vs. Throughput Tradeoffs

Inference workloads face a fundamental tension:

Hardware and software choices differ dramatically depending on which dimension you optimize for. Groq's LPU, for instance, sacrifices cost-efficiency for extraordinary latency, while batched GPU inference maximizes throughput at the expense of per-request latency.

Quantization for Deployment

Quantization reduces model weights from high-precision floating-point numbers to lower-precision integers, dramatically reducing memory usage and compute requirements:

Precision Bits per Weight Memory Savings Quality Impact Best For
FP16/BF16 16 bits Baseline None Training, high-fidelity inference
FP8 8 bits 2x Minimal Training (Hopper+), fast inference
INT8 8 bits 2x Very low Production inference
INT4 (GPTQ, AWQ) 4 bits 4x Low-moderate Edge deployment, cost-sensitive
GGUF Q4_K_M ~4.8 bits (mixed) ~3.3x Low Local/CPU inference (llama.cpp)
1-2 bit (extreme) 1-2 bits 8-16x Significant Research, constrained edge devices

Edge AI vs. Cloud Inference

Cloud Inference

  • Access to the largest, most capable models
  • Scales elastically with demand
  • Requires internet connectivity
  • Data leaves the device (privacy concern)
  • Pay-per-use cost model

Edge AI Inference

  • Data stays on device (privacy-first)
  • Ultra-low latency (no network round trip)
  • Works offline
  • Limited to smaller, quantized models
  • Fixed hardware cost, no ongoing fees

Inference Optimization Techniques

Modern inference engines employ multiple strategies to maximize performance:

5. Cloud AI Platforms

The three major cloud providers each offer comprehensive AI/ML platforms, though they differ in strengths, GPU availability, and ecosystem integration.

Amazon Web Services (AWS)

Broadest compute selection, custom silicon
  • Compute: P5 (H100), P4d (A100), Trainium, Inferentia
  • ML Platform: SageMaker (end-to-end ML)
  • Foundation Models: Amazon Bedrock (multi-provider)
  • Custom Silicon: Trainium2 for cost-effective training
  • Strengths: Largest market share, most services, custom chips for cost savings
  • Pricing Edge: Spot instances up to 90% discount

Microsoft Azure

OpenAI partnership, enterprise integration
  • Compute: ND H100 v5, ND A100 v4, Maia AI chip
  • ML Platform: Azure Machine Learning
  • Foundation Models: Azure OpenAI Service (GPT-4, o1)
  • Unique: Exclusive OpenAI API access, Copilot integration
  • Strengths: Enterprise customers, Microsoft 365 integration, hybrid cloud
  • Pricing Edge: Azure Reserved Instances

Google Cloud Platform (GCP)

Custom TPUs, AI-native infrastructure
  • Compute: A3 (H100), A2 (A100), Cloud TPU v5e/v5p
  • ML Platform: Vertex AI (end-to-end ML)
  • Foundation Models: Model Garden (Gemini, PaLM, open models)
  • Unique: TPU pods for massive parallel training
  • Strengths: Best-in-class AI/ML research pedigree, JAX/TPU ecosystem
  • Pricing Edge: Preemptible VMs, sustained use discounts

Beyond the Big Three: Specialized GPU cloud providers like CoreWeave, Lambda Labs, Together AI, and RunPod offer competitive pricing on GPU instances, often with shorter wait times for H100/A100 access. They are increasingly popular with AI startups and researchers who need raw compute without the complexity of full cloud platforms.

6. Future of AI Hardware

As AI models continue to scale, the semiconductor industry faces fundamental physical and economic challenges. Several emerging technologies aim to break through current limitations.

Optical Computing

Photonic processors use light instead of electrons for matrix multiplications, potentially achieving orders-of-magnitude improvements in energy efficiency. Companies like Lightmatter and Luminous Computing are developing photonic interconnects and compute elements. Optical computing is particularly promising for inference workloads where the same matrix operations are repeated billions of times.

Neuromorphic Chips

Inspired by biological neural networks, neuromorphic processors like Intel Loihi 2 and IBM NorthPole use event-driven, spike-based computation rather than clock-driven arithmetic. They excel at tasks involving temporal patterns and sparse data, consuming a fraction of the power of conventional chips. While not yet competitive for transformer-based LLMs, they show promise for edge AI, robotics, and sensor processing.

Quantum Machine Learning

Quantum computers could theoretically accelerate specific ML algorithms exponentially, particularly for optimization problems, molecular simulation, and certain types of feature mapping. However, current noisy intermediate-scale quantum (NISQ) devices remain far from practical utility for mainstream AI workloads. The timeline for "quantum advantage" in AI remains uncertain — likely a decade or more for meaningful impact.

The Energy Efficiency Challenge

AI's energy consumption is the industry's most pressing sustainability concern:

Looking Ahead: The next decade will likely see a diversification of AI compute: GPUs for general training, custom ASICs for specific workloads, photonic interconnects for data movement, and neuromorphic chips for edge sensing. No single technology will dominate — the future is heterogeneous computing orchestrated by intelligent software stacks.

Continue Exploring AI Infrastructure

Dive deeper into benchmarks, model comparisons, and the terminology behind AI hardware and infrastructure.