The Transformer has dominated sequence modeling for years, but its quadratic attention cost has always been a limitation. State Space Models (SSMs) offer a fundamentally different approach: processing sequences in linear time using principles from continuous-time signal processing. Mamba, the most successful SSM to date, has shown that these models can match Transformer quality on language tasks while being significantly faster for long sequences. Could SSMs be the architecture that eventually supplements or replaces the Transformer?
What Are State Space Models?
State space models are inspired by control theory and signal processing. They model a sequence by mapping inputs to outputs through a hidden state that evolves over time according to a set of learned parameters. The key equations are:
h'(t) = Ah(t) + Bx(t)
y(t) = Ch(t) + Dx(t)
Where h(t) is the hidden state, x(t) is the input, y(t) is the output, and A, B, C, D are learned matrices. The state evolves continuously, but for practical computation, these equations are discretized to work with discrete token sequences.
The remarkable property of SSMs is that they can be computed in two modes:
- Recurrent mode: Processing one token at a time by updating the hidden state, ideal for efficient inference.
- Convolutional mode: Processing the entire sequence in parallel using a convolution, ideal for efficient training.
This dual mode gives SSMs the best of both worlds: parallel training like Transformers and constant-time per-step inference like RNNs.
"State space models represent a return to recurrent computation, but with the mathematical sophistication to overcome the problems that plagued traditional RNNs."
The Evolution: From S4 to Mamba
S4: Structured State Spaces
The S4 model by Albert Gu et al. (2021) was the breakthrough that made SSMs competitive. The key innovation was a special initialization for the A matrix (the HiPPO initialization) that allowed the model to efficiently compress long-range dependencies into a fixed-size state. S4 excelled on tasks requiring very long-range reasoning, like the Long Range Arena benchmark, where Transformers struggled.
Mamba: Selective State Spaces
Mamba, introduced by Gu and Dao in late 2023, addressed S4's main limitation: the state transition parameters were the same for all inputs. Mamba introduced input-dependent (selective) parameters, allowing the model to dynamically decide what information to remember and what to forget based on the current input. This selectivity is analogous to the gating in LSTMs but applied within the SSM framework.
The key innovations in Mamba include:
- Selective scan: The B, C, and delta parameters are functions of the input, allowing content-dependent reasoning.
- Hardware-aware implementation: A custom CUDA kernel that avoids materializing the full state in HBM, similar in spirit to Flash Attention.
- Simplified architecture: Removing the attention and MLP blocks entirely, replacing them with a single Mamba block that combines selective SSM with gated linear units.
Key Takeaway
Mamba's selective state spaces allow the model to dynamically decide what to remember from the input, giving it content-dependent reasoning ability similar to attention but with linear time complexity instead of quadratic.
Mamba vs Transformers: Performance Comparison
Mamba has shown compelling results across several dimensions:
- Language modeling: Mamba matches or exceeds Transformer models of the same size on standard language modeling benchmarks.
- Long sequences: Mamba's linear complexity gives it a significant advantage on very long sequences where Transformers become prohibitively expensive.
- Inference speed: Mamba's recurrent inference mode processes tokens in constant time per step, compared to the growing cost per token in Transformers (due to the expanding KV cache).
- Training throughput: Mamba achieves 3-5x higher training throughput than Transformers of comparable size.
However, Transformers maintain advantages in certain areas. The attention mechanism's ability to perform direct pairwise comparisons between any two tokens gives it an edge on tasks requiring precise retrieval from context (like in-context learning and information extraction). Mamba's fixed-size state, while much more efficient, may not preserve all information from very long contexts as faithfully as attention.
Hybrid Architectures
The most promising direction may be hybrid models that combine SSM layers with attention layers. AI21's Jamba model demonstrated this approach, interleaving Mamba layers with attention layers. The attention layers provide precise retrieval capability while the Mamba layers handle the bulk of sequence processing efficiently.
This hybrid approach offers the best of both worlds: the efficiency of SSMs for most of the computation with the precision of attention where it matters most. Other hybrid approaches like Zamba and various research prototypes are exploring different ratios and combinations of SSM and attention layers.
The Broader Landscape
Mamba is not the only alternative to standard attention. Several other architectures are competing in this space:
- RWKV: An architecture that combines elements of RNNs and Transformers, offering linear-time computation with competitive quality.
- Linear attention: Variants that approximate standard attention with linear complexity using kernel methods.
- RetNet: Microsoft's architecture that can operate in parallel (training), recurrent (inference), or chunked modes.
Key Takeaway
State space models represent the most credible challenge to Transformer dominance. Mamba's combination of linear-time complexity, competitive quality, and hardware-aware implementation makes it a serious contender. The future likely belongs to hybrid architectures that combine the strengths of both approaches.
