Sparse Attention
Attention mechanisms that attend to only a subset of positions rather than all positions, reducing the quadratic cost of standard attention to sub-quadratic or linear.
Patterns
Local/sliding window: Attend to nearby tokens only. Strided: Attend to every Nth token. Global tokens: Special tokens that attend to everything. Learned sparsity: Let the model learn which positions to attend to.
Implementations
Longformer, BigBird, and Mistral's sliding window attention use sparse patterns to handle longer sequences efficiently. Combined with Flash Attention for maximum performance.