Training an LLM is expensive, but serving it to millions of users is where the real cost accumulates. Every query requires a forward pass through billions of parameters, and the autoregressive generation process means each token requires a separate pass. Inference optimization is the art and science of making this process faster and cheaper, and it is critical for making LLMs practical at scale.

Understanding the Inference Bottleneck

LLM inference has two distinct phases with different performance characteristics:

Prefill Phase

The model processes the entire input prompt in parallel. This is compute-bound: the GPU is busy performing matrix multiplications across all input tokens simultaneously. Prefill latency scales linearly with prompt length.

Decode Phase

The model generates output tokens one at a time. Each token requires reading the entire model's weights from memory. This is memory-bandwidth-bound: the GPU spends most of its time waiting for data to load from memory rather than performing computations. This is why LLM generation feels slow -- the hardware is underutilized.

The decode phase is the primary bottleneck in LLM serving. The model reads billions of parameters from memory for each token generated but only performs a relatively small computation, leaving the GPU's compute capacity largely idle.

Key Takeaway

LLM inference has two phases: compute-bound prefill (fast, parallel) and memory-bound decode (slow, sequential). Most optimization efforts focus on the decode phase where the bottleneck is memory bandwidth, not compute.

KV Caching

The most fundamental inference optimization is the key-value (KV) cache. During autoregressive generation, each new token's attention computation requires the keys and values from all previous tokens. Without caching, the model would recompute these for the entire sequence at every step.

The KV cache stores the key and value tensors from all previous tokens in all attention layers. When generating token n+1, only the new token's query, key, and value need to be computed; the previous tokens' keys and values are retrieved from cache. This reduces the computation per token from O(n) to O(1), providing a massive speedup.

However, the KV cache consumes significant memory. For a 70B parameter model with 128K context, the KV cache can require 40+ GB, often exceeding the model weights themselves.

Continuous Batching

Traditional batching groups requests and processes them together. The problem is that different requests finish at different times, and the batch must wait for the longest request. Continuous batching (also called iteration-level batching), pioneered by frameworks like Orca and vLLM, solves this by allowing new requests to enter the batch as soon as existing ones complete.

This dramatically improves GPU utilization and throughput. Instead of GPUs sitting idle while waiting for the slowest request in a batch, they are continuously processing new tokens for new requests.

PagedAttention and vLLM

PagedAttention, introduced by the vLLM project, applies virtual memory concepts to KV cache management. Instead of allocating a contiguous block of memory for each request's KV cache, PagedAttention stores KV data in fixed-size pages that can be allocated on demand and shared between requests.

This provides several benefits:

  • Reduced memory waste: No need to pre-allocate for maximum sequence length; pages are allocated as needed
  • Memory sharing: Requests with shared prefixes (like system prompts) can share KV cache pages
  • Higher throughput: More requests can fit in GPU memory simultaneously

Speculative Decoding

Speculative decoding addresses the sequential nature of autoregressive generation. A small, fast "draft" model generates several candidate tokens quickly. The large target model then verifies all candidates in parallel (a single forward pass). Tokens that match the target model's distribution are accepted; the first mismatch triggers regeneration from that point.

Since verification is parallel, speculative decoding can generate multiple tokens in the time it normally takes to generate one. Speedups of 2-3x are common, with the guarantee that output quality is identical to standard generation.

Key Takeaway

Speculative decoding uses a fast draft model to propose multiple tokens that a large model verifies in parallel, achieving 2-3x speedups with mathematically identical output quality.

Flash Attention

Flash Attention, developed by Tri Dao, optimizes the attention computation by restructuring it to minimize data movement between GPU compute cores and memory. Standard attention materializes the full n x n attention matrix, which is slow for long sequences. Flash Attention computes attention in tiles, keeping data in fast SRAM as much as possible.

Flash Attention reduces both memory usage (no full attention matrix) and wall-clock time (fewer memory transfers), enabling longer context windows and faster processing. Flash Attention 2 and 3 further improved performance through better parallelism and hardware utilization.

Serving Frameworks

Several open-source frameworks implement these optimizations:

  • vLLM: PagedAttention, continuous batching, speculative decoding, and quantization support
  • TensorRT-LLM: NVIDIA's optimized inference engine with kernel fusion and INT8/FP8 quantization
  • llama.cpp: CPU and GPU inference with aggressive quantization, enabling LLMs on consumer hardware
  • SGLang: RadixAttention for efficient prefix caching and structured generation

The inference optimization landscape evolves rapidly. Each improvement makes LLMs cheaper and faster to serve, expanding the range of applications where they are economically viable. For production deployments, combining multiple techniques -- quantization, continuous batching, Flash Attention, and speculative decoding -- can provide order-of-magnitude improvements in cost efficiency.