Context Length Extrapolation
Techniques that allow models to handle longer sequences at inference than they saw during training.
Overview
Context length extrapolation enables language models to process sequences longer than those seen during training. Since training on very long sequences is computationally expensive, models are often trained on shorter contexts and then extended through various techniques at inference time.
Techniques
RoPE scaling: Adjusting rotary position embedding frequencies (NTK-aware scaling, YaRN). ALiBi: Linear bias attention that naturally extrapolates. Positional interpolation: Compressing position indices to fit within the trained range. Ring attention: Distributing long contexts across multiple devices. These techniques have enabled models trained on 4K or 8K contexts to handle 100K+ tokens at inference.