AI Glossary

Paged Attention

A memory management technique for LLM inference that handles KV cache like virtual memory pages.

Overview

Paged Attention, introduced by the vLLM project, applies operating system virtual memory concepts to manage the key-value (KV) cache during LLM inference. Instead of pre-allocating contiguous memory for each sequence's KV cache, it stores cache in non-contiguous memory blocks (pages) mapped by a page table.

Key Details

This eliminates memory fragmentation and waste from over-allocation, improving GPU memory utilization from ~20-40% to near 100%. The result is 2-4x higher serving throughput with the same hardware. Paged Attention also enables efficient memory sharing for techniques like parallel sampling and beam search, where multiple sequences share common prefixes. It has been widely adopted in production LLM serving systems.

Related Concepts

kv cache • vllm • inference optimization

← Back to AI Glossary