AI Glossary

vLLM

An open-source library for fast LLM inference and serving, using PagedAttention to efficiently manage GPU memory and increase throughput.

Key Innovation: PagedAttention

vLLM uses PagedAttention, which manages the KV cache like virtual memory pages. This eliminates memory waste from pre-allocation and fragmentation, enabling 2-4x higher throughput than naive implementations.

Features

Continuous batching (dynamically add/remove requests), tensor parallelism (split across GPUs), support for 100+ model architectures, OpenAI-compatible API server, and quantization support (AWQ, GPTQ, FP8).

Usage

vLLM has become the standard for self-hosted LLM serving. It powers inference for many AI companies and is the recommended serving solution for most open-source LLM deployments.

← Back to AI Glossary

Last updated: March 5, 2026