Efficient Transformers: A Survey of Faster Architectures

The Transformer architecture is remarkable but expensive. Self-attention has O(N^2) complexity in both time and memory, meaning that doubling the sequence length quadruples the cost. As models grow larger and context windows extend to millions of tokens, the need for efficient Transformer variants has become critical. This survey covers the major approaches to making Transformers faster, from attention optimization to model compression and architectural alternatives.

The Efficiency Taxonomy

Approaches to Transformer efficiency can be organized into several categories, each targeting different bottlenecks in the computation pipeline.

Attention Pattern Modifications

The most direct approach is changing how attention is computed. Standard full attention allows every token to attend to every other token, creating the N^2 cost. By restricting the attention pattern, we can reduce this cost:

Sparse attention: Only compute attention between a subset of token pairs. Longformer and BigBird combine local sliding window attention with a few global tokens, achieving O(N) complexity while maintaining the ability to propagate information across the full sequence.
Grouped/sliding window attention: Mistral's approach limits attention to a fixed window around each token, with information propagating through layers for longer-range dependencies.
Multi-query/grouped query attention: Sharing key-value heads across multiple query heads reduces the KV cache size and memory bandwidth requirements during inference without significantly affecting quality.

Linear Attention

Linear attention methods replace the softmax attention kernel with an alternative formulation that allows the computation to be rearranged from O(N^2) to O(N). The key idea is to decompose the attention operation into separate key-value and query-value products using kernel functions.

While mathematically elegant, linear attention methods have historically struggled to match standard attention quality on language modeling tasks. The softmax's ability to create sharp, selective attention patterns appears to be important for performance. Recent work on gated linear attention has closed this gap by adding data-dependent gating mechanisms.

"The quest for efficient Transformers is not about finding a single solution but assembling the right combination of techniques for each specific deployment scenario."

Key Takeaway

Attention efficiency is a spectrum: from exact optimizations like Flash Attention (same quality, less memory) to approximate methods like sparse and linear attention (slightly different quality, dramatically lower cost). The best choice depends on your quality requirements and computational constraints.

Hardware-Aware Optimizations

Some of the most impactful efficiency improvements come from optimizing how computations interact with hardware.

Flash Attention

Flash Attention, covered in detail in our dedicated article, is the gold standard for attention optimization. It computes exact standard attention while reducing memory from O(N^2) to O(N) and achieving 2-5x speedup through IO-aware tiling and kernel fusion. Flash Attention is now the default in most production systems.

Paged Attention (vLLM)

For inference serving, the KV cache is a major memory bottleneck. Paged Attention, introduced in the vLLM framework, manages KV cache memory like virtual memory in operating systems -- allocating memory in pages and eliminating waste from pre-allocated contiguous buffers. This can improve throughput by 2-4x in production serving scenarios.

Continuous Batching

Traditional batch serving waits for all requests in a batch to complete before starting a new batch. Continuous batching inserts new requests into the batch as soon as a slot opens up, keeping GPU utilization high. Combined with paged attention, this enables dramatically higher throughput for inference serving.

Model Compression Techniques

Beyond architectural changes, several techniques reduce model size and computation without changing the architecture.

Quantization

Reducing weight precision from 16-bit to 8-bit, 4-bit, or even lower. Modern quantization techniques like GPTQ, AWQ, and GGUF can compress models by 4x with minimal quality loss. The key insight is that model weights have low effective precision -- most of the information is in the most significant bits.

Pruning

Removing unnecessary parameters from the model. Structured pruning removes entire attention heads or feed-forward neurons, enabling actual speedup without specialized hardware. Unstructured pruning removes individual weights, creating sparse matrices that require specialized kernels to exploit.

Knowledge Distillation

Training a smaller "student" model to mimic a larger "teacher." The student learns from the teacher's output probabilities, which contain richer information than hard labels. Distillation can create models that are 10x smaller with only modest quality degradation.

Inference-Specific Optimizations

Several techniques specifically target inference efficiency, where patterns differ from training.

Speculative Decoding

Speculative decoding uses a small, fast draft model to generate candidate tokens, which the larger model verifies in parallel. Because verification is cheaper than generation (it can be batched), this achieves the quality of the large model at near-draft-model speed. Speedups of 2-3x are common.

KV Cache Optimization

The KV cache grows with every generated token, eventually dominating memory usage. Techniques to optimize it include:

KV cache quantization: Compressing cached keys and values to lower precision.
Sliding window eviction: Discarding oldest entries beyond a fixed window.
Attention sink: Keeping the first few tokens' cache (which receive disproportionate attention) while evicting middle tokens.

Key Takeaway

Making Transformers efficient is a multi-dimensional challenge requiring techniques at every level: architecture (sparse/linear attention), hardware (Flash Attention, paged attention), compression (quantization, pruning), and inference (speculative decoding, KV cache optimization). The most effective systems combine multiple approaches.

Alternative Architectures

Some approaches abandon the Transformer framework entirely in pursuit of efficiency.

State Space Models (Mamba)

Mamba and its variants offer O(N) complexity with competitive quality by using selective state spaces instead of attention. Hybrid Mamba-Transformer architectures are emerging as a promising direction that combines SSM efficiency with attention precision.

RWKV

RWKV combines elements of RNNs and Transformers, offering linear complexity with a formulation that can be computed in both recurrent and parallel modes. It has shown competitive results on language modeling tasks.

Choosing the Right Approach

The optimal efficiency strategy depends on your specific constraints:

Memory-constrained deployment: Focus on quantization (4-bit GPTQ/AWQ) and smaller model sizes.
Latency-critical applications: Use speculative decoding, smaller models, and optimized serving frameworks like vLLM.
Long-context requirements: Flash Attention for exact attention, sparse attention patterns for very long contexts, or SSMs for ultra-long sequences.
High-throughput serving: Combine paged attention, continuous batching, and quantization for maximum requests per second.
Training efficiency: Flash Attention, mixed precision training, and efficient data loading are the highest-impact optimizations.

Key Takeaway

The field of efficient Transformers is vast and rapidly evolving. The key principle is to identify your specific bottleneck -- memory, latency, throughput, or context length -- and apply the targeted optimization that addresses it. Most production systems stack multiple techniques for compound improvements.

Efficient Transformers: A Survey of Faster Architectures

The Efficiency Taxonomy

Attention Pattern Modifications

Linear Attention

Key Takeaway

Hardware-Aware Optimizations

Flash Attention

Paged Attention (vLLM)

Continuous Batching

Model Compression Techniques

Quantization

Pruning

Knowledge Distillation

Inference-Specific Optimizations

Speculative Decoding

KV Cache Optimization

Key Takeaway

Alternative Architectures

State Space Models (Mamba)

RWKV

Choosing the Right Approach

Key Takeaway

References & Sources

Related Posts

Flash Attention: Making Transformers 5x Faster

State Space Models and Mamba: The Transformer Alternative

Mixture of Experts (MoE): How Sparse Models Scale Efficiently