AI Glossary

Sparse Attention

Attention mechanisms that attend to only a subset of positions rather than all positions, reducing the quadratic cost of standard attention to sub-quadratic or linear.

Patterns

Local/sliding window: Attend to nearby tokens only. Strided: Attend to every Nth token. Global tokens: Special tokens that attend to everything. Learned sparsity: Let the model learn which positions to attend to.

Implementations

Longformer, BigBird, and Mistral's sliding window attention use sparse patterns to handle longer sequences efficiently. Combined with Flash Attention for maximum performance.

← Back to AI Glossary

Sparse Attention

Patterns

Implementations

Related Articles

The Attention Mechanism: How AI Learned to Focus

Related Concepts