AI Glossary

Self-Attention

The mechanism within transformers where each token in a sequence computes attention scores with every other token, determining how much to 'attend to' each position.

How It Works

Each token is projected into three vectors: Query (Q), Key (K), and Value (V). Attention scores are computed as softmax(QK^T / sqrt(d_k)). These scores determine how much information each token receives from every other token. The output is a weighted sum of Value vectors.

Multi-Head Attention

Instead of a single attention computation, the model runs multiple attention 'heads' in parallel, each learning to attend to different types of relationships (syntactic, semantic, positional). The outputs are concatenated and projected.

Complexity

Standard self-attention is O(n^2) in sequence length, which is why long-context processing is expensive. Efficient variants include Flash Attention (IO-aware), sparse attention, and linear attention approximations.

← Back to AI Glossary

Self-Attention

How It Works

Multi-Head Attention

Complexity

Related Articles

Self-Attention vs Cross-Attention: A Visual Guide

The Attention Mechanism: How AI Learned to Focus

Self-Consistency Prompting: Improving AI Reliability

Encoder-Decoder Architecture: From Seq2Seq to Transformers

Transformer Architecture: Attention Is All You Need Explained

Related Concepts