AI Glossary

Attention Head

An individual attention mechanism within a multi-head attention layer, learning to focus on different aspects of the input like syntax, semantics, or position.

Multi-Head Attention

Instead of one attention function, transformers use multiple parallel attention heads (typically 12-128). Each head has its own query, key, and value projections, allowing it to attend to different patterns simultaneously.

What Heads Learn

Research shows different heads specialize: some track syntactic relationships (subject-verb), others handle coreference (pronoun resolution), positional patterns, or semantic similarity. This specialization emerges naturally during training.

Head Pruning

Not all heads are equally important. Studies show many heads can be removed with minimal performance loss. Head pruning is a model compression technique that reduces computation while maintaining quality.

← Back to AI Glossary

Attention Head

Multi-Head Attention

What Heads Learn

Head Pruning

Related Articles

The Attention Mechanism: How AI Learned to Focus

Related Concepts