Positional Encoding
A mechanism that injects information about token positions into transformer models, since the attention mechanism itself has no inherent sense of order.
Why It's Needed
Unlike RNNs that process tokens sequentially, transformers process all tokens in parallel. Without positional encoding, 'the cat sat on the mat' and 'mat the on sat cat the' would produce identical representations.
Types
Sinusoidal (original): Fixed mathematical patterns of different frequencies. Learned: Trained embeddings for each position (BERT, GPT-2). RoPE (Rotary): Encodes relative positions through rotation, enabling length generalization (used in LLaMA, Mistral).
Context Length Extension
RoPE and ALiBi (Attention with Linear Biases) enable models to generalize to longer sequences than seen during training. This is a key technique behind extending LLM context windows beyond their training length.