Self-Attention
The mechanism within transformers where each token in a sequence computes attention scores with every other token, determining how much to 'attend to' each position.
How It Works
Each token is projected into three vectors: Query (Q), Key (K), and Value (V). Attention scores are computed as softmax(QK^T / sqrt(d_k)). These scores determine how much information each token receives from every other token. The output is a weighted sum of Value vectors.
Multi-Head Attention
Instead of a single attention computation, the model runs multiple attention 'heads' in parallel, each learning to attend to different types of relationships (syntactic, semantic, positional). The outputs are concatenated and projected.
Complexity
Standard self-attention is O(n^2) in sequence length, which is why long-context processing is expensive. Efficient variants include Flash Attention (IO-aware), sparse attention, and linear attention approximations.