AI Hardware & Infrastructure: GPUs, TPUs, Accelerators & Data Centers

1. The GPU Revolution

Why GPUs Dominate AI Training

Graphics Processing Units (GPUs) were originally designed for rendering pixels in video games, but their architecture turned out to be perfectly suited for the matrix multiplications that underpin deep learning. While a CPU excels at sequential, complex tasks with a handful of powerful cores, a GPU distributes thousands of simpler computations across thousands of cores simultaneously.

Modern deep learning involves multiplying enormous matrices — a forward pass through a transformer model with billions of parameters requires trillions of floating-point operations. GPUs handle these through massive parallelism, high memory bandwidth, and specialized tensor cores designed specifically for mixed-precision matrix math.

CPU vs GPU: Architectural Comparison

CPU — Few Powerful Cores

Core 1

Core 2

Core 3

Core 4

Core 5

Core 6

Core 7

Core 8

8-64 cores optimized for sequential tasks, complex branching, and low-latency single-thread performance.

GPU — Thousands of Small Cores

10,000+ CUDA/Tensor cores optimized for parallel matrix operations, high throughput, and massive data parallelism.

NVIDIA's Dominance in AI

NVIDIA has established an unassailable lead in AI compute, not just through hardware but through its CUDA software ecosystem, which has become the de facto standard for AI development. Virtually every major deep learning framework — PyTorch, TensorFlow, JAX — is optimized for CUDA, creating a powerful moat.

NVIDIA A100 (Ampere, 2020)

The workhorse that trained most of the foundation models we use today. The A100 introduced third-generation Tensor Cores with TF32 precision and multi-instance GPU (MIG) technology, allowing a single GPU to be partitioned into up to seven independent instances. Available in 40 GB and 80 GB HBM2e variants, the A100 was the first GPU purpose-built for the AI data center era.

NVIDIA H100 (Hopper, 2023)

A generational leap that roughly tripled AI training performance over the A100. The H100 introduced the Transformer Engine with FP8 precision, dramatically accelerating transformer-based workloads. Its fourth-generation NVLink provides 900 GB/s of GPU-to-GPU bandwidth, and it became the most sought-after chip in computing history, with wait times stretching to months.

NVIDIA H200 (Hopper, 2024)

An enhanced H100 with 141 GB of HBM3e memory (up from 80 GB HBM3), providing 4.8 TB/s of memory bandwidth. The H200 is particularly impactful for large language model inference, where memory capacity determines the maximum model size that can fit on a single GPU.

NVIDIA B100/B200 (Blackwell, 2024-2025)

NVIDIA's Blackwell architecture represents another major leap. The B200 features a dual-die design with 192 GB of HBM3e, delivering up to 2.5x the training performance and 5x the inference performance of the H100. The GB200 "superchip" pairs two B200 GPUs with a Grace CPU via NVLink-C2C, forming the building block of next-generation AI supercomputers.

AMD MI300X: The Challenger

AMD's Instinct MI300X is the most credible alternative to NVIDIA's data center GPUs. With 192 GB of HBM3 memory — significantly more than the H100's 80 GB — the MI300X is particularly attractive for large model inference where memory capacity is the bottleneck. AMD's ROCm software stack has matured considerably, with PyTorch support improving, though it still trails CUDA in ecosystem breadth.

GPU Specifications Comparison

GPU	Architecture	VRAM	Memory BW	FP16 TFLOPS	Interconnect	TDP
NVIDIA A100	Ampere	80 GB HBM2e	2.0 TB/s	312	NVLink 3 (600 GB/s)	300W
NVIDIA H100 SXM	Hopper	80 GB HBM3	3.35 TB/s	990	NVLink 4 (900 GB/s)	700W
NVIDIA H200	Hopper	141 GB HBM3e	4.8 TB/s	990	NVLink 4 (900 GB/s)	700W
NVIDIA B200	Blackwell	192 GB HBM3e	8.0 TB/s	2,250	NVLink 5 (1,800 GB/s)	1,000W
AMD MI300X	CDNA 3	192 GB HBM3	5.3 TB/s	1,307	Infinity Fabric	750W

Key Insight: Raw TFLOPS numbers don't tell the whole story. Real-world performance depends heavily on software optimization, memory bandwidth utilization, and interconnect speed for multi-GPU workloads. NVIDIA's CUDA ecosystem often delivers 20-40% higher utilization rates than competing platforms due to mature tooling and compiler optimizations.

2. AI Accelerators & Custom Silicon

While GPUs are general-purpose parallel processors adapted for AI, a growing number of companies are designing application-specific integrated circuits (ASICs) purpose-built for machine learning workloads. These custom chips sacrifice flexibility for efficiency, often delivering better performance-per-watt for specific AI tasks.

Google TPU (v1 – v5e)

Google's Tensor Processing Units pioneered custom AI silicon. TPU v1 (2016) was inference-only; by TPU v4 (2022), each pod delivers 1.1 exaflops. TPU v5e (2023) optimizes for cost-efficient training and inference at scale. Designed around a systolic array architecture that excels at matrix multiplications, TPUs power Google Search, Gmail, Translate, and Gemini.

Available through Google Cloud TPU, making custom silicon accessible without owning the hardware.

Google Cloud

Apple Neural Engine

Integrated into Apple's M-series and A-series chips, the Neural Engine handles on-device ML tasks like Face ID, computational photography, Siri, and Apple Intelligence features. The M4 Neural Engine delivers up to 38 TOPS (trillion operations per second), enabling powerful AI without cloud dependency.

Represents the edge AI paradigm: processing happens locally on the device for privacy, low latency, and offline capability.

On-Device AI

AWS Trainium & Inferentia

Trainium2 is Amazon's custom training chip, offering up to 4x better price-performance than comparable GPU instances for supported workloads. Inferentia2 targets inference with high throughput and low latency at reduced cost. Both are tightly integrated with AWS SageMaker and the Neuron SDK.

AWS uses these chips to provide cost-competitive alternatives to NVIDIA GPU instances for customers willing to use the Neuron compiler.

AWS Custom Silicon

Intel Gaudi 3

Intel's Gaudi accelerators (acquired through Habana Labs) target cost-effective AI training. Gaudi 3 offers competitive performance with H100 for training workloads at a lower price point. Features 96 GB HBM2e and integrated Ethernet networking, eliminating the need for separate InfiniBand switches.

Intel AI

Cerebras Wafer-Scale Engine

The most radical approach to AI hardware: a single chip the size of an entire silicon wafer. The WSE-3 contains 4 trillion transistors, 900,000 AI cores, and 44 GB of on-chip SRAM with 21 PB/s of memory bandwidth. By eliminating chip-to-chip communication bottlenecks, Cerebras achieves extraordinary performance for sparse and large-model training.

Wafer-Scale

Groq LPU (Language Processing Unit)

Groq's LPU takes a fundamentally different approach: a deterministic architecture with no external memory (HBM). All data flows through on-chip SRAM in a predictable pattern, eliminating the memory bandwidth bottleneck. This enables extraordinary inference speeds — generating 500+ tokens per second for LLMs — at predictable latency.

Optimized exclusively for inference, not training. Ideal for real-time conversational AI.

Inference-First

The Build vs. Buy Decision: Companies like Google, Amazon, and Apple build custom silicon to reduce dependency on NVIDIA and optimize for their specific workloads. However, the cost of designing, fabricating, and maintaining custom chips runs into billions of dollars — making it viable only for hyperscalers processing AI at enormous scale.

3. Training Infrastructure

Training a frontier AI model is one of the most demanding engineering challenges in computing. It requires orchestrating thousands of GPUs across a data center, moving petabytes of data, and sustaining computation for weeks or months without failure.

Data Center Design for AI

AI data centers differ fundamentally from traditional cloud data centers in three critical dimensions:

Power density: A single rack of 8 NVIDIA B200 GPUs can draw over 10 kW, compared to 5-7 kW for a typical cloud server rack. AI clusters of 10,000+ GPUs demand 50-100+ MW of power — equivalent to a small city.
Cooling: At these power densities, traditional air cooling is insufficient. Modern AI data centers increasingly use direct liquid cooling (cold plates on each GPU) or immersion cooling (submerging entire servers in dielectric fluid).
Networking: AI training requires constant communication between GPUs. The network fabric must support ultra-low latency, high-bandwidth all-to-all communication patterns that differ from typical web service traffic.

GPU Clusters & Interconnects

Modern training clusters are built in a hierarchy of interconnects, each optimized for different communication distances:

Tensor Cores

Within GPU

NVLink

GPU-to-GPU
900 GB/s

NVSwitch

Within Node
8 GPUs

InfiniBand

Node-to-Node
400 Gb/s

Data Center Fabric

10,000+ GPUs

NVLink: NVIDIA's proprietary GPU-to-GPU interconnect. NVLink 4 (Hopper) provides 900 GB/s bidirectional bandwidth. NVLink 5 (Blackwell) doubles this to 1,800 GB/s. Critical for tensor parallelism where individual layers are split across GPUs.
NVSwitch: Enables all-to-all communication within a node (typically 8 GPUs) at full NVLink bandwidth, creating a unified memory space.
InfiniBand (NDR/XDR): The dominant node-to-node fabric for AI clusters, providing 400 Gb/s (NDR) or 800 Gb/s (XDR) per port with extremely low latency. Essential for data parallelism and pipeline parallelism across nodes.
Ethernet alternatives: Ultra Ethernet Consortium (UEC) is developing AI-optimized Ethernet to challenge InfiniBand's dominance, with support from AMD, Intel, Google, Meta, and Microsoft.

Distributed Training Strategies

No single GPU can train a frontier model alone. Distributed training splits the workload using several complementary strategies:

Data parallelism: Each GPU holds a complete copy of the model but processes different batches of data. Gradients are synchronized across all GPUs after each step. Simple to implement but requires the model to fit in a single GPU's memory.
Tensor parallelism: Individual layers are split across multiple GPUs within a node, using NVLink for the frequent communication required. Best for very wide layers in large transformers.
Pipeline parallelism: Different layers of the model are placed on different GPUs. Data flows sequentially through the pipeline. Reduces memory requirements per GPU but introduces "pipeline bubbles" that reduce efficiency.
Expert parallelism: Used for Mixture-of-Experts (MoE) models, where different experts reside on different GPUs and only a subset are activated for each input token.
ZeRO (Zero Redundancy Optimizer): Partitions optimizer states, gradients, and parameters across GPUs to eliminate memory redundancy while maintaining data parallelism semantics. Developed by Microsoft DeepSpeed.

The Cost of Training Frontier Models

Model	Organization	Estimated Training Cost	GPU Hours (approx.)
GPT-3 (175B)	OpenAI	$4-5M	3.6M A100-hours
GPT-4	OpenAI	$50-100M+	Tens of millions
Llama 3.1 405B	Meta	$30-60M	30.8M H100-hours
Gemini Ultra	Google	$50-100M+	TPU v5e pods
Claude 3.5 Sonnet	Anthropic	$30-50M (est.)	Not disclosed

Beyond Compute Cost: The total cost of training includes data curation, human annotation (RLHF), engineering salaries, electricity, and the opportunity cost of GPU allocation. For frontier labs, the fully loaded cost of a single training run can exceed $200M when all factors are included.

4. Inference Infrastructure

While training captures the headlines, inference — running trained models to generate predictions — accounts for 80-90% of total AI compute spending in production. Optimizing inference is where hardware meets real-world economics.

Latency vs. Throughput Tradeoffs

Inference workloads face a fundamental tension:

Latency-sensitive: Chatbots, real-time translation, and interactive AI assistants need fast time-to-first-token (TTFT) and high tokens-per-second. Users notice delays beyond 200ms for the first token.
Throughput-sensitive: Batch processing, document analysis, and offline embedding generation prioritize processing the most requests per dollar, even if individual requests take longer.

Hardware and software choices differ dramatically depending on which dimension you optimize for. Groq's LPU, for instance, sacrifices cost-efficiency for extraordinary latency, while batched GPU inference maximizes throughput at the expense of per-request latency.

Quantization for Deployment

Quantization reduces model weights from high-precision floating-point numbers to lower-precision integers, dramatically reducing memory usage and compute requirements:

Precision	Bits per Weight	Memory Savings	Quality Impact	Best For
FP16/BF16	16 bits	Baseline	None	Training, high-fidelity inference
FP8	8 bits	2x	Minimal	Training (Hopper+), fast inference
INT8	8 bits	2x	Very low	Production inference
INT4 (GPTQ, AWQ)	4 bits	4x	Low-moderate	Edge deployment, cost-sensitive
GGUF Q4_K_M	~4.8 bits (mixed)	~3.3x	Low	Local/CPU inference (llama.cpp)
1-2 bit (extreme)	1-2 bits	8-16x	Significant	Research, constrained edge devices

Edge AI vs. Cloud Inference

Cloud Inference

Access to the largest, most capable models
Scales elastically with demand
Requires internet connectivity
Data leaves the device (privacy concern)
Pay-per-use cost model

Edge AI Inference

Data stays on device (privacy-first)
Ultra-low latency (no network round trip)
Works offline
Limited to smaller, quantized models
Fixed hardware cost, no ongoing fees

Inference Optimization Techniques

Modern inference engines employ multiple strategies to maximize performance:

Continuous batching: Instead of waiting for a fixed batch to fill, the server dynamically adds and removes requests from the batch as they arrive and complete. Implemented in vLLM, TensorRT-LLM, and TGI.
KV cache management: During autoregressive generation, the key-value pairs from previous tokens are cached to avoid redundant computation. PagedAttention (vLLM) manages KV cache like virtual memory, reducing waste from 60-80% to near zero.
Speculative decoding: A smaller "draft" model generates candidate tokens quickly, then the larger model verifies them in parallel. If most drafts are accepted, you get the quality of the large model at speeds closer to the small model. Can achieve 2-3x speedups.
Flash Attention: A memory-efficient attention algorithm that reduces GPU memory usage from O(n^2) to O(n) and accelerates attention computation by minimizing reads/writes to slow GPU HBM. Now standard in all major frameworks.
Model distillation: Training a smaller "student" model to mimic a larger "teacher" model's outputs, creating a more efficient model that retains most of the original's capability.

5. Cloud AI Platforms

The three major cloud providers each offer comprehensive AI/ML platforms, though they differ in strengths, GPU availability, and ecosystem integration.

Amazon Web Services (AWS)

Broadest compute selection, custom silicon

Compute: P5 (H100), P4d (A100), Trainium, Inferentia
ML Platform: SageMaker (end-to-end ML)
Foundation Models: Amazon Bedrock (multi-provider)
Custom Silicon: Trainium2 for cost-effective training
Strengths: Largest market share, most services, custom chips for cost savings
Pricing Edge: Spot instances up to 90% discount

Microsoft Azure

OpenAI partnership, enterprise integration

Compute: ND H100 v5, ND A100 v4, Maia AI chip
ML Platform: Azure Machine Learning
Foundation Models: Azure OpenAI Service (GPT-4, o1)
Unique: Exclusive OpenAI API access, Copilot integration
Strengths: Enterprise customers, Microsoft 365 integration, hybrid cloud
Pricing Edge: Azure Reserved Instances

Google Cloud Platform (GCP)

Custom TPUs, AI-native infrastructure

Compute: A3 (H100), A2 (A100), Cloud TPU v5e/v5p
ML Platform: Vertex AI (end-to-end ML)
Foundation Models: Model Garden (Gemini, PaLM, open models)
Unique: TPU pods for massive parallel training
Strengths: Best-in-class AI/ML research pedigree, JAX/TPU ecosystem
Pricing Edge: Preemptible VMs, sustained use discounts

Beyond the Big Three: Specialized GPU cloud providers like CoreWeave, Lambda Labs, Together AI, and RunPod offer competitive pricing on GPU instances, often with shorter wait times for H100/A100 access. They are increasingly popular with AI startups and researchers who need raw compute without the complexity of full cloud platforms.

6. Future of AI Hardware

As AI models continue to scale, the semiconductor industry faces fundamental physical and economic challenges. Several emerging technologies aim to break through current limitations.

Optical Computing

Photonic processors use light instead of electrons for matrix multiplications, potentially achieving orders-of-magnitude improvements in energy efficiency. Companies like Lightmatter and Luminous Computing are developing photonic interconnects and compute elements. Optical computing is particularly promising for inference workloads where the same matrix operations are repeated billions of times.

Neuromorphic Chips

Inspired by biological neural networks, neuromorphic processors like Intel Loihi 2 and IBM NorthPole use event-driven, spike-based computation rather than clock-driven arithmetic. They excel at tasks involving temporal patterns and sparse data, consuming a fraction of the power of conventional chips. While not yet competitive for transformer-based LLMs, they show promise for edge AI, robotics, and sensor processing.

Quantum Machine Learning

Quantum computers could theoretically accelerate specific ML algorithms exponentially, particularly for optimization problems, molecular simulation, and certain types of feature mapping. However, current noisy intermediate-scale quantum (NISQ) devices remain far from practical utility for mainstream AI workloads. The timeline for "quantum advantage" in AI remains uncertain — likely a decade or more for meaningful impact.

The Energy Efficiency Challenge

AI's energy consumption is the industry's most pressing sustainability concern:

Training a single large model can emit as much CO2 as five cars over their lifetimes.
AI data centers are projected to consume 3-4% of global electricity by 2030, up from roughly 1-2% today.
The demand is driving investment in nuclear power (SMRs), renewable energy PPAs, and next-generation cooling technology.
Algorithmic efficiency gains (better architectures, sparse models, efficient training recipes) are equally important as hardware improvements.

Looking Ahead: The next decade will likely see a diversification of AI compute: GPUs for general training, custom ASICs for specific workloads, photonic interconnects for data movement, and neuromorphic chips for edge sensing. No single technology will dominate — the future is heterogeneous computing orchestrated by intelligent software stacks.

AI Hardware & Infrastructure