Speculative Decoding
Accelerating LLM inference by using a small draft model to predict multiple tokens verified in parallel by the main model.
Overview
Speculative decoding speeds up autoregressive LLM inference by using a smaller, faster draft model to generate several candidate tokens ahead, then verifying them in parallel with the larger target model. Accepted tokens are kept; rejected ones are regenerated from the target model's distribution.
Key Details
Because the target model can verify multiple tokens in a single forward pass (which takes roughly the same time as generating one token), accepted speculative tokens are essentially 'free.' This technique can achieve 2-3x speedups without changing the output distribution. Variants include self-speculative decoding (using early exit from the same model) and Medusa (adding lightweight prediction heads). It's particularly effective when the draft model has high acceptance rates.
Related Concepts
speculative decoding • inference optimization • model distillation