AI Glossary

KV Cache

A memory optimization for transformer inference that stores previously computed key and value matrices, avoiding redundant computation when generating tokens one at a time.

Why It's Needed

During autoregressive generation, each new token needs to attend to all previous tokens. Without caching, the model would recompute keys and values for all previous tokens at every step, making generation O(n^2).

How It Works

After computing keys and values for each token, store them in a cache. When generating the next token, only compute Q/K/V for the new token and attend to the cached K/V from previous tokens. This makes each generation step O(n) instead of O(n^2).

Memory Challenge

The KV cache grows linearly with sequence length and batch size. For long-context models, the KV cache can consume more GPU memory than the model weights. Techniques like GQA (Grouped Query Attention), paged attention (vLLM), and KV cache quantization address this.

← Back to AI Glossary

KV Cache

Why It's Needed

How It Works

Memory Challenge

Related Articles

K-Nearest Neighbors: The Simplest ML Algorithm

Efficient Transformers: A Survey of Faster Architectures

LLM Inference Optimization: Making Models Faster

Decoder-Only Models: The GPT Family and Autoregressive Generation

Context Windows Explained: Why Token Limits Matter

Related Concepts