RoPE (Rotary Position Embedding)
A positional encoding method that encodes position information through rotation of the query and key vectors, enabling relative position awareness and length generalization.
How It Works
RoPE applies rotation matrices to query and key vectors before computing attention. The rotation angle depends on the token's position. This naturally encodes relative positions: the attention between two tokens depends on their distance, not absolute positions.
Why It's Popular
RoPE enables length generalization -- models can handle longer sequences than seen during training. It's also computationally efficient and compatible with KV caching. Used in LLaMA, Mistral, Qwen, and most modern LLMs.
Extensions
YaRN and NTK-aware scaling extend RoPE to handle very long contexts (e.g., going from 4K to 128K tokens). These techniques modify the rotation frequencies to work with longer sequences.