Attention Sink
The phenomenon where LLMs disproportionately attend to the first few tokens regardless of their content.
Overview
Attention sink is a phenomenon observed in large language models where the attention mechanism assigns disproportionately high attention scores to the first few tokens in the sequence, regardless of their semantic relevance. This occurs because the softmax in attention requires scores to sum to 1, and the model learns to 'dump' unused attention mass onto initial tokens.
Key Details
Understanding attention sinks has practical implications for efficient LLM serving. The StreamingLLM approach leverages this by keeping only the initial attention sink tokens plus a rolling window of recent tokens, enabling infinite-length text generation with bounded memory. This enables streaming applications that would otherwise be limited by the context window size.