AI Glossary

Continuous Batching

An inference optimization that dynamically adds and removes requests from a batch as they complete.

Overview

Continuous batching (also called iteration-level scheduling) is an LLM serving optimization where new requests are added to the processing batch as soon as existing requests complete, rather than waiting for the entire batch to finish. This dramatically improves GPU utilization and throughput.

Key Details

In static batching, short sequences must wait for the longest sequence in the batch to finish, wasting compute. Continuous batching allows sequences to enter and leave the batch at each generation step. Combined with Paged Attention, continuous batching can improve LLM serving throughput by 10-20x compared to naive approaches. It's implemented in vLLM, TGI (Text Generation Inference), and TensorRT-LLM.

Related Concepts

paged attention • inference optimization • model serving

← Back to AI Glossary