AI Glossary

Attention Head

An individual attention mechanism within a multi-head attention layer, learning to focus on different aspects of the input like syntax, semantics, or position.

Multi-Head Attention

Instead of one attention function, transformers use multiple parallel attention heads (typically 12-128). Each head has its own query, key, and value projections, allowing it to attend to different patterns simultaneously.

What Heads Learn

Research shows different heads specialize: some track syntactic relationships (subject-verb), others handle coreference (pronoun resolution), positional patterns, or semantic similarity. This specialization emerges naturally during training.

Head Pruning

Not all heads are equally important. Studies show many heads can be removed with minimal performance loss. Head pruning is a model compression technique that reduces computation while maintaining quality.

← Back to AI Glossary

Last updated: March 5, 2026