AI Glossary

Interleaved Attention

Alternating between local and global attention patterns in transformer layers for efficiency.

Overview

Interleaved attention is a transformer design pattern that alternates between different attention mechanisms across layers — typically mixing local (sliding window) attention with global (full sequence) attention. This allows the model to efficiently process long sequences while maintaining the ability to attend to distant tokens when needed.

Key Details

Models like Gemma 2, Mistral, and Jamba use interleaved attention patterns, where most layers use efficient local attention (or even SSM layers) with periodic global attention layers. This achieves near-full attention quality at a fraction of the computational cost. The pattern leverages the observation that most attention heads primarily attend to nearby tokens, with only a few requiring long-range connections.

Related Concepts

attention mechanism • sparse attention • state space model

← Back to AI Glossary