AI Glossary

Attention Head

An individual attention computation within a multi-head attention layer, each head learning to focus on different types of relationships in the input.

What Each Head Learns

Different attention heads specialize in different patterns: syntactic relationships (subject-verb agreement), positional patterns (attending to nearby tokens), semantic relationships (coreference resolution), or factual recall.

Multi-Head Attention

A transformer layer typically has 12-128 attention heads running in parallel. Each operates on a lower-dimensional projection of the input. Their outputs are concatenated and projected back to the full dimension.

Head Pruning

Research shows that many attention heads can be removed without significantly hurting performance, suggesting redundancy. This insight is used for model compression and efficiency improvements.

← Back to AI Glossary

Last updated: March 5, 2026