Attention Head
An individual attention mechanism within a multi-head attention layer, learning to focus on different aspects of the input like syntax, semantics, or position.
Multi-Head Attention
Instead of one attention function, transformers use multiple parallel attention heads (typically 12-128). Each head has its own query, key, and value projections, allowing it to attend to different patterns simultaneously.
What Heads Learn
Research shows different heads specialize: some track syntactic relationships (subject-verb), others handle coreference (pronoun resolution), positional patterns, or semantic similarity. This specialization emerges naturally during training.
Head Pruning
Not all heads are equally important. Studies show many heads can be removed with minimal performance loss. Head pruning is a model compression technique that reduces computation while maintaining quality.