vLLM
An open-source library for fast LLM inference and serving, using PagedAttention to efficiently manage GPU memory and increase throughput.
Key Innovation: PagedAttention
vLLM uses PagedAttention, which manages the KV cache like virtual memory pages. This eliminates memory waste from pre-allocation and fragmentation, enabling 2-4x higher throughput than naive implementations.
Features
Continuous batching (dynamically add/remove requests), tensor parallelism (split across GPUs), support for 100+ model architectures, OpenAI-compatible API server, and quantization support (AWQ, GPTQ, FP8).
Usage
vLLM has become the standard for self-hosted LLM serving. It powers inference for many AI companies and is the recommended serving solution for most open-source LLM deployments.