Continuous Batching
An inference optimization that dynamically adds and removes requests from a batch as they complete.
Overview
Continuous batching (also called iteration-level scheduling) is an LLM serving optimization where new requests are added to the processing batch as soon as existing requests complete, rather than waiting for the entire batch to finish. This dramatically improves GPU utilization and throughput.
Key Details
In static batching, short sequences must wait for the longest sequence in the batch to finish, wasting compute. Continuous batching allows sequences to enter and leave the batch at each generation step. Combined with Paged Attention, continuous batching can improve LLM serving throughput by 10-20x compared to naive approaches. It's implemented in vLLM, TGI (Text Generation Inference), and TensorRT-LLM.