AI Glossary

Attention Score

A numerical weight that determines how much a token attends to (or focuses on) another token in the attention mechanism of transformers.

Computation

Attention scores are computed as the dot product of query and key vectors, divided by the square root of the key dimension (for stability), then normalized via softmax to sum to 1.

Interpretation

Higher attention scores mean the model considers that token more relevant for processing the current token. Visualizing attention patterns reveals what relationships the model has learned.

Causal vs. Bidirectional

In causal models (GPT), tokens can only attend to previous tokens. In bidirectional models (BERT), tokens attend to all positions. This fundamental choice shapes what tasks the model can perform.

← Back to AI Glossary

Last updated: March 5, 2026