AI Glossary

vLLM

An open-source library for fast LLM inference and serving, using PagedAttention to efficiently manage GPU memory and increase throughput.

Key Innovation: PagedAttention

vLLM uses PagedAttention, which manages the KV cache like virtual memory pages. This eliminates memory waste from pre-allocation and fragmentation, enabling 2-4x higher throughput than naive implementations.

Features

Continuous batching (dynamically add/remove requests), tensor parallelism (split across GPUs), support for 100+ model architectures, OpenAI-compatible API server, and quantization support (AWQ, GPTQ, FP8).

Usage

vLLM has become the standard for self-hosted LLM serving. It powers inference for many AI companies and is the recommended serving solution for most open-source LLM deployments.

← Back to AI Glossary

vLLM

Key Innovation: PagedAttention

Features

Usage

Related Articles

LLM Inference Optimization: Making Models Faster

LLM Benchmarks: MMLU, HumanEval, and How We Measure Intelligence

LLM Cost Optimization: Running AI Without Breaking the Bank

Fine-Tuning LLMs with LoRA and QLoRA: A Practical Guide

LLM Hallucinations: Why AI Makes Things Up and How to Fix It

Related Concepts