Transformers process all tokens in a sequence simultaneously through self-attention, which gives them a massive speed advantage over sequential models like RNNs. But this parallelism comes with a fundamental problem: without some additional mechanism, a transformer cannot distinguish "The dog chased the cat" from "The cat chased the dog." The self-attention operation treats its inputs as an unordered set, not a sequence. Positional encoding is the solution, injecting information about token positions so the model understands word order.
Why Position Matters
In natural language, the order of words carries enormous meaning. "Alice loves Bob" and "Bob loves Alice" contain the exact same words but describe completely different situations. An RNN naturally captures this because it processes tokens one at a time, with the hidden state encoding the history of what came before. But a transformer's self-attention is permutation-equivariant: if you shuffle the input tokens, the output simply shuffles correspondingly. The attention weights between tokens do not change.
This means that without positional information, the sentence "I ate lunch then dinner" would produce the same internal representations as "dinner then ate I lunch." Clearly, some mechanism for encoding position is essential.
Self-attention sees words as a bag of tokens. Positional encoding is what transforms that bag into a meaningful sequence.
Sinusoidal Positional Encoding
The original transformer paper by Vaswani et al. introduced sinusoidal positional encoding, using sine and cosine functions of different frequencies to create unique position vectors:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Here, pos is the position in the sequence, i is the dimension index, and d_model is the embedding dimension. Each dimension of the positional encoding uses a sinusoid with a different frequency, ranging from very high frequency (short wavelength) for the first dimensions to very low frequency (long wavelength) for the last dimensions.
The resulting positional encoding vector is added to the token embedding, not concatenated. This means the model must learn to disentangle positional information from semantic information within the same vector space.
Why Sinusoids?
The choice of sinusoidal functions was deliberate and has elegant mathematical properties:
- Unique positions: Each position gets a unique encoding vector, so the model can distinguish any two positions
- Relative positions: The encoding of position pos+k can be expressed as a linear function of the encoding at position pos, allowing the model to learn to attend to relative positions
- Bounded values: Sine and cosine are bounded between -1 and 1, keeping the positional signal from overwhelming the token embeddings
- Generalization: The model can theoretically generalize to sequence lengths longer than those seen during training
Key Takeaway
Sinusoidal positional encodings use waves of different frequencies across dimensions to create unique, bounded position vectors that encode both absolute and relative position information.
Learned Positional Embeddings
An alternative to fixed sinusoidal encodings is to learn the positional embeddings from data, just like token embeddings. BERT, GPT-2, and many other models take this approach. A learnable embedding matrix of shape (max_length, d_model) is initialized randomly, and each position index maps to a row of this matrix.
The advantages of learned embeddings include:
- The model can adapt positional representations to the specific task and data distribution
- No assumption about the nature of positional relationships is baked in
- Implementation is straightforward: it is just another embedding lookup
However, learned embeddings have a significant limitation: they have a fixed maximum length. If the model is trained with a maximum sequence length of 512, it has no embedding for position 513. This makes it unable to generalize to longer sequences at inference time, unlike sinusoidal encodings which can in principle be computed for any position.
In practice, the original transformer paper found that sinusoidal and learned positional encodings performed comparably, suggesting that the specific form of encoding matters less than the fact that positional information is included at all.
Rotary Position Embedding (RoPE)
The most influential positional encoding innovation in recent years is Rotary Position Embedding (RoPE), proposed by Su et al. and adopted by LLaMA, Mistral, and many other modern LLMs. RoPE takes a fundamentally different approach: instead of adding positional information to the embeddings, it encodes position by rotating the query and key vectors in attention.
The core idea is elegant. Pairs of dimensions in the query and key vectors are treated as 2D coordinates and rotated by an angle proportional to the position:
f(x, pos) = R(theta * pos) * x
where R is a 2D rotation matrix
When computing the dot product between a query at position m and a key at position n, the rotation angles subtract, making the attention score depend only on the relative position (m - n) rather than the absolute positions.
Why RoPE Works So Well
RoPE has several compelling properties that explain its widespread adoption:
- Relative position encoding: Attention scores naturally depend on relative distance, matching linguistic intuition
- Decay with distance: The dot product between rotated vectors naturally decays with increasing relative distance, providing an inductive bias toward local attention
- Extensibility: With techniques like NTK-aware scaling and YaRN, RoPE can be extended to longer sequences than those seen during training
- Efficiency: Rotation can be implemented efficiently without additional parameters
ALiBi: Attention with Linear Biases
ALiBi (Attention with Linear Biases), proposed by Press et al., takes yet another approach. Instead of modifying the embeddings or the Q/K vectors, ALiBi adds a linear bias directly to the attention scores. The bias is proportional to the distance between query and key positions, and each attention head uses a different slope.
attention_score = q_i * k_j - m * |i - j|
The penalty m * |i - j| increases linearly with distance, naturally encouraging the model to attend more to nearby tokens. Different heads get different slopes, so some heads can attend locally while others maintain broader attention patterns.
ALiBi's key advantage is its ability to extrapolate to longer sequences. Models trained with ALiBi on sequences of length 1024 can often perform well on sequences of length 2048 or beyond without any fine-tuning, because the linear bias naturally extends to any distance.
Key Takeaway
Modern positional encoding methods like RoPE and ALiBi have largely replaced the original sinusoidal approach. RoPE encodes relative position through rotation, while ALiBi uses distance-based attention biases. Both enable better length generalization than fixed positional embeddings.
The Future of Position in Transformers
Positional encoding remains an active area of research. As context windows expand from thousands to millions of tokens, the ability to handle position information at scale becomes increasingly critical. Techniques like YaRN extend RoPE to handle much longer contexts, while ring attention and context parallelism address the engineering challenges of processing very long sequences across multiple GPUs.
Some researchers are exploring whether position information can be made more dynamic, varying based on the content rather than just the position index. Others are investigating whether the distinction between absolute and relative position encoding can be unified into a single, more flexible framework.
What is clear is that positional encoding, despite being one of the simpler components of the transformer, has an outsized impact on model capabilities, especially on the ability to handle long documents, maintain coherence over extended generation, and generalize to sequence lengths beyond training.
