Attention Head
An individual attention computation within a multi-head attention layer, each head learning to focus on different types of relationships in the input.
What Each Head Learns
Different attention heads specialize in different patterns: syntactic relationships (subject-verb agreement), positional patterns (attending to nearby tokens), semantic relationships (coreference resolution), or factual recall.
Multi-Head Attention
A transformer layer typically has 12-128 attention heads running in parallel. Each operates on a lower-dimensional projection of the input. Their outputs are concatenated and projected back to the full dimension.
Head Pruning
Research shows that many attention heads can be removed without significantly hurting performance, suggesting redundancy. This insight is used for model compression and efficiency improvements.