The Transformer architecture is remarkable but expensive. Self-attention has O(N^2) complexity in both time and memory, meaning that doubling the sequence length quadruples the cost. As models grow larger and context windows extend to millions of tokens, the need for efficient Transformer variants has become critical. This survey covers the major approaches to making Transformers faster, from attention optimization to model compression and architectural alternatives.
The Efficiency Taxonomy
Approaches to Transformer efficiency can be organized into several categories, each targeting different bottlenecks in the computation pipeline.
Attention Pattern Modifications
The most direct approach is changing how attention is computed. Standard full attention allows every token to attend to every other token, creating the N^2 cost. By restricting the attention pattern, we can reduce this cost:
- Sparse attention: Only compute attention between a subset of token pairs. Longformer and BigBird combine local sliding window attention with a few global tokens, achieving O(N) complexity while maintaining the ability to propagate information across the full sequence.
- Grouped/sliding window attention: Mistral's approach limits attention to a fixed window around each token, with information propagating through layers for longer-range dependencies.
- Multi-query/grouped query attention: Sharing key-value heads across multiple query heads reduces the KV cache size and memory bandwidth requirements during inference without significantly affecting quality.
Linear Attention
Linear attention methods replace the softmax attention kernel with an alternative formulation that allows the computation to be rearranged from O(N^2) to O(N). The key idea is to decompose the attention operation into separate key-value and query-value products using kernel functions.
While mathematically elegant, linear attention methods have historically struggled to match standard attention quality on language modeling tasks. The softmax's ability to create sharp, selective attention patterns appears to be important for performance. Recent work on gated linear attention has closed this gap by adding data-dependent gating mechanisms.
"The quest for efficient Transformers is not about finding a single solution but assembling the right combination of techniques for each specific deployment scenario."
Key Takeaway
Attention efficiency is a spectrum: from exact optimizations like Flash Attention (same quality, less memory) to approximate methods like sparse and linear attention (slightly different quality, dramatically lower cost). The best choice depends on your quality requirements and computational constraints.
Hardware-Aware Optimizations
Some of the most impactful efficiency improvements come from optimizing how computations interact with hardware.
Flash Attention
Flash Attention, covered in detail in our dedicated article, is the gold standard for attention optimization. It computes exact standard attention while reducing memory from O(N^2) to O(N) and achieving 2-5x speedup through IO-aware tiling and kernel fusion. Flash Attention is now the default in most production systems.
Paged Attention (vLLM)
For inference serving, the KV cache is a major memory bottleneck. Paged Attention, introduced in the vLLM framework, manages KV cache memory like virtual memory in operating systems -- allocating memory in pages and eliminating waste from pre-allocated contiguous buffers. This can improve throughput by 2-4x in production serving scenarios.
Continuous Batching
Traditional batch serving waits for all requests in a batch to complete before starting a new batch. Continuous batching inserts new requests into the batch as soon as a slot opens up, keeping GPU utilization high. Combined with paged attention, this enables dramatically higher throughput for inference serving.
Model Compression Techniques
Beyond architectural changes, several techniques reduce model size and computation without changing the architecture.
Quantization
Reducing weight precision from 16-bit to 8-bit, 4-bit, or even lower. Modern quantization techniques like GPTQ, AWQ, and GGUF can compress models by 4x with minimal quality loss. The key insight is that model weights have low effective precision -- most of the information is in the most significant bits.
Pruning
Removing unnecessary parameters from the model. Structured pruning removes entire attention heads or feed-forward neurons, enabling actual speedup without specialized hardware. Unstructured pruning removes individual weights, creating sparse matrices that require specialized kernels to exploit.
Knowledge Distillation
Training a smaller "student" model to mimic a larger "teacher." The student learns from the teacher's output probabilities, which contain richer information than hard labels. Distillation can create models that are 10x smaller with only modest quality degradation.
Inference-Specific Optimizations
Several techniques specifically target inference efficiency, where patterns differ from training.
Speculative Decoding
Speculative decoding uses a small, fast draft model to generate candidate tokens, which the larger model verifies in parallel. Because verification is cheaper than generation (it can be batched), this achieves the quality of the large model at near-draft-model speed. Speedups of 2-3x are common.
KV Cache Optimization
The KV cache grows with every generated token, eventually dominating memory usage. Techniques to optimize it include:
- KV cache quantization: Compressing cached keys and values to lower precision.
- Sliding window eviction: Discarding oldest entries beyond a fixed window.
- Attention sink: Keeping the first few tokens' cache (which receive disproportionate attention) while evicting middle tokens.
Key Takeaway
Making Transformers efficient is a multi-dimensional challenge requiring techniques at every level: architecture (sparse/linear attention), hardware (Flash Attention, paged attention), compression (quantization, pruning), and inference (speculative decoding, KV cache optimization). The most effective systems combine multiple approaches.
Alternative Architectures
Some approaches abandon the Transformer framework entirely in pursuit of efficiency.
State Space Models (Mamba)
Mamba and its variants offer O(N) complexity with competitive quality by using selective state spaces instead of attention. Hybrid Mamba-Transformer architectures are emerging as a promising direction that combines SSM efficiency with attention precision.
RWKV
RWKV combines elements of RNNs and Transformers, offering linear complexity with a formulation that can be computed in both recurrent and parallel modes. It has shown competitive results on language modeling tasks.
Choosing the Right Approach
The optimal efficiency strategy depends on your specific constraints:
- Memory-constrained deployment: Focus on quantization (4-bit GPTQ/AWQ) and smaller model sizes.
- Latency-critical applications: Use speculative decoding, smaller models, and optimized serving frameworks like vLLM.
- Long-context requirements: Flash Attention for exact attention, sparse attention patterns for very long contexts, or SSMs for ultra-long sequences.
- High-throughput serving: Combine paged attention, continuous batching, and quantization for maximum requests per second.
- Training efficiency: Flash Attention, mixed precision training, and efficient data loading are the highest-impact optimizations.
Key Takeaway
The field of efficient Transformers is vast and rapidly evolving. The key principle is to identify your specific bottleneck -- memory, latency, throughput, or context length -- and apply the targeted optimization that addresses it. Most production systems stack multiple techniques for compound improvements.
