KV Cache
A memory optimization for transformer inference that stores previously computed key and value matrices, avoiding redundant computation when generating tokens one at a time.
Why It's Needed
During autoregressive generation, each new token needs to attend to all previous tokens. Without caching, the model would recompute keys and values for all previous tokens at every step, making generation O(n^2).
How It Works
After computing keys and values for each token, store them in a cache. When generating the next token, only compute Q/K/V for the new token and attend to the cached K/V from previous tokens. This makes each generation step O(n) instead of O(n^2).
Memory Challenge
The KV cache grows linearly with sequence length and batch size. For long-context models, the KV cache can consume more GPU memory than the model weights. Techniques like GQA (Grouped Query Attention), paged attention (vLLM), and KV cache quantization address this.