If you have spent any time studying transformers, you have encountered the terms "self-attention" and "cross-attention." Both use the same fundamental Query-Key-Value (QKV) computation, but they serve very different purposes. Self-attention lets a sequence examine itself, while cross-attention lets one sequence examine another. This distinction is crucial for understanding how modern AI architectures like GPT, BERT, and encoder-decoder models actually work.

The QKV Framework: A Quick Refresher

Before diving into the differences, let us revisit the shared foundation. Every attention computation involves three components:

  • Query (Q): Represents what we are looking for -- the current position asking "what should I pay attention to?"
  • Key (K): Represents what each position offers -- like labels on file folders
  • Value (V): Represents the actual content to retrieve -- the information inside those folders

The attention score between a query and a key determines how much of the corresponding value gets included in the output. Mathematically, it is the same operation regardless of whether we are doing self-attention or cross-attention:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

The critical difference lies in where Q, K, and V come from.

Self-Attention: A Sequence Talks to Itself

In self-attention, all three matrices -- Q, K, and V -- are derived from the same input sequence. Each token generates its own query, key, and value by multiplying the token's embedding with learned weight matrices W_Q, W_K, and W_V.

Consider the sentence: "The cat sat on the mat." In self-attention, each word computes attention scores with every other word in the same sentence. The word "cat" might attend strongly to "sat" (its verb) and "The" (its determiner), while "mat" might attend to "on" and "the."

Self-attention is like a group discussion where every participant listens to everyone else in the same room, building a richer understanding of the collective conversation.

Why Self-Attention Is Powerful

Self-attention captures intra-sequence relationships regardless of distance. In an RNN, the relationship between the first word and the hundredth word requires information to flow through 99 sequential steps, degrading along the way. In self-attention, every pair of positions is directly connected in a single computation step.

This property makes self-attention especially effective for:

  • Coreference resolution: Connecting pronouns to their antecedents across long distances
  • Syntactic structure: Learning grammatical relationships between words
  • Semantic similarity: Identifying thematically related parts of a text

Key Takeaway

In self-attention, Q, K, and V all come from the same sequence. This enables every position to directly attend to every other position, capturing long-range dependencies in a single step.

Cross-Attention: Bridging Two Sequences

In cross-attention, the queries come from one sequence while the keys and values come from a different sequence. This is the mechanism that allows information to flow between the encoder and decoder in transformer models.

In a translation task, for example, the decoder generates queries from the partially generated French output, while the keys and values come from the English input processed by the encoder. At each step, the decoder asks: "Given what I am generating right now, which parts of the English input should I focus on?"

How Cross-Attention Differs Structurally

The structural difference is straightforward:

  • Self-attention: Q = X * W_Q, K = X * W_K, V = X * W_V (X is the same for all three)
  • Cross-attention: Q = Y * W_Q, K = X * W_K, V = X * W_V (queries from Y, keys and values from X)

Here, X is the encoder output (or source sequence) and Y is the decoder input (or target sequence). The dimension of the keys and queries must match for the dot product to work, but the sequences X and Y can have different lengths.

Where Each Type Appears in Transformers

The original transformer architecture uses both types of attention in specific locations:

Encoder Block

Contains self-attention only. Each layer of the encoder applies self-attention so that every input token can attend to every other input token. This builds a rich, contextual representation of the source sequence.

Decoder Block

Contains both self-attention and cross-attention. First, masked self-attention allows each position in the output to attend to previous output positions (but not future ones, to preserve autoregressive generation). Then, cross-attention connects the decoder to the encoder's output, enabling the decoder to draw from the source sequence.

Encoder-Only Models (BERT)

Use only self-attention. Since BERT processes the entire input bidirectionally and does not generate output autoregressively, it has no need for cross-attention. Every token attends to every other token in the input.

Decoder-Only Models (GPT)

Use only masked self-attention. Models like GPT have no encoder to cross-attend to. Each token attends only to itself and all preceding tokens, which enables autoregressive text generation.

Key Takeaway

Encoder-only models use bidirectional self-attention, decoder-only models use masked (causal) self-attention, and encoder-decoder models use all three: encoder self-attention, decoder masked self-attention, and encoder-decoder cross-attention.

Cross-Attention Beyond Text

Cross-attention is not limited to text-to-text tasks. It has become the standard mechanism for connecting different modalities or information streams:

  • Text-to-image generation: In models like Stable Diffusion, the U-Net denoising network uses cross-attention to condition image generation on text prompts. The queries come from image features, while keys and values come from text embeddings.
  • Visual question answering: The model cross-attends from question tokens to image regions, focusing on the relevant parts of the image for each aspect of the question.
  • Speech recognition: The decoder cross-attends from text tokens to audio features, aligning the generated transcript with the spoken words.
  • Retrieval-augmented generation: Some RAG architectures use cross-attention to integrate retrieved document passages with the generation process.

Cross-attention is the universal bridge in modern AI. Whenever a model needs to combine information from two different sources, cross-attention is typically the mechanism that connects them.

Practical Considerations and Trade-offs

When designing or fine-tuning models, understanding the computational implications of each attention type matters:

Self-attention has complexity O(n^2) where n is the sequence length. For long sequences, this can be expensive. However, since all pairs are computed within a single sequence, the computation is highly parallelizable.

Cross-attention has complexity O(n * m) where n is the query sequence length and m is the key/value sequence length. If the encoder output is cached (as it typically is during inference), cross-attention can be very efficient because the keys and values do not change as the decoder generates tokens.

This caching property is why encoder-decoder models can sometimes be more efficient at inference time than decoder-only models for tasks like translation, even though they have more total parameters. The encoder runs once, and its output is reused at every decoder step through cross-attention.

Understanding when to use self-attention versus cross-attention is fundamental to designing effective transformer architectures. Self-attention builds rich internal representations, while cross-attention enables information flow between different processing stages or modalities. Together, they form the computational backbone of modern AI.