Context Windows Explained: Why Token Limits Matter

Every conversation with an AI assistant has an invisible boundary: the context window. This is the maximum amount of text -- measured in tokens -- that the model can process in a single interaction. Everything you type, every system instruction, and every response the model generates must fit within this window. Understanding context windows is essential for using LLMs effectively and choosing the right model for your task.

What Is a Context Window?

A context window is the total number of tokens an LLM can process at once. One token is roughly 3/4 of an English word, so a 100K token context window can handle approximately 75,000 words -- about the length of a novel.

The context window includes everything: the system prompt, the conversation history, any documents you paste in, and the model's responses. When you hit the limit, the oldest content gets pushed out, and the model loses access to it.

Think of the context window as the model's working memory. Everything it can "remember" and reason about must fit within this window. Once information falls outside the window, it is gone.

How Context Windows Have Grown

The expansion of context windows has been one of the most dramatic improvements in LLMs:

GPT-2 (2019): 1,024 tokens (~750 words)
GPT-3 (2020): 2,048-4,096 tokens
Claude 1 (2023): 100,000 tokens (~75,000 words)
GPT-4 Turbo (2023): 128,000 tokens
Claude 3 (2024): 200,000 tokens
Gemini 1.5 Pro (2024): 1,000,000-2,000,000 tokens

This represents a 1,000x improvement in just five years. The practical implications are enormous: models can now process entire codebases, book-length documents, and hours of transcript in a single interaction.

Key Takeaway

Context windows have grown from ~1K tokens to over 1M tokens in five years. Larger context windows enable entirely new use cases like full codebase analysis, book-length document processing, and extended multi-turn conversations.

Why Context Windows Are Hard to Expand

The fundamental challenge is that standard self-attention has quadratic complexity: O(n^2) in sequence length. Doubling the context window quadruples the computation and memory required. This creates two specific bottlenecks:

Memory: The KV Cache Problem

During generation, the model must store key-value pairs for every previous token in every attention layer. For a 70B parameter model with a 128K context window, the KV cache alone can require 40+ GB of GPU memory. This is often the binding constraint on context length.

Compute: Attention Over Long Sequences

Each new token generated must attend to all previous tokens. As the context grows, each generation step becomes more expensive. A model processing 100K tokens takes roughly 100x more compute per token than one processing 1K tokens.

Techniques for Longer Context

Researchers have developed several approaches to extend context windows:

RoPE scaling: Extending rotary position embeddings to handle longer sequences than those seen during training, using techniques like NTK-aware scaling and YaRN
Flash Attention: Hardware-aware implementation that dramatically reduces memory usage by avoiding materializing the full attention matrix
Grouped Query Attention: Sharing key-value heads to reduce KV cache size
Ring Attention: Distributing long-context computation across multiple GPUs by passing KV blocks in a ring
Sliding window attention: Limiting attention to a local window with global attention at specific positions

The "Lost in the Middle" Problem

Having a large context window does not guarantee the model will use all of it effectively. Research has shown that LLMs exhibit a "lost in the middle" phenomenon: they recall information placed at the beginning and end of the context much better than information in the middle.

This has practical implications for how you structure prompts and documents. Important information should be placed at the beginning or end of the context, not buried in the middle. Some applications mitigate this by re-ranking or repeating key information.

Key Takeaway

Even with large context windows, LLMs recall information at the start and end better than the middle. Structure your prompts accordingly, placing critical information at the beginning or end.

Context Window vs. RAG

With context windows reaching millions of tokens, an important question arises: do we still need Retrieval-Augmented Generation (RAG)? The answer is yes, but the relationship is evolving.

Long context windows are better when you need the model to understand the full document and reason across different parts. RAG is better when you have a very large knowledge base (millions of documents), need the most up-to-date information, or want to attribute answers to specific sources.

In practice, the best systems often combine both: using RAG to retrieve relevant documents and then processing them within a long context window for comprehensive understanding. Context windows and RAG are complementary, not competing, approaches.

Context Windows Explained: Why Token Limits Matter

What Is a Context Window?

How Context Windows Have Grown

Key Takeaway

Why Context Windows Are Hard to Expand

Memory: The KV Cache Problem

Compute: Attention Over Long Sequences

Techniques for Longer Context

The "Lost in the Middle" Problem

Key Takeaway

Context Window vs. RAG

References & Sources

Related Glossary Terms

Context Windows Explained: Why Token Limits Matter

What Is a Context Window?

How Context Windows Have Grown

Key Takeaway

Why Context Windows Are Hard to Expand

Memory: The KV Cache Problem

Compute: Attention Over Long Sequences

Techniques for Longer Context

The "Lost in the Middle" Problem

Key Takeaway

Context Window vs. RAG

References & Sources

Related Glossary Terms

Related Posts

LLM Tokenization: BPE, WordPiece, and SentencePiece

What Are Large Language Models? The Complete Guide

LLM Inference Optimization: Making Models Faster