AI Glossary

Speculative Decoding

Accelerating LLM inference by using a small draft model to predict multiple tokens verified in parallel by the main model.

Overview

Speculative decoding speeds up autoregressive LLM inference by using a smaller, faster draft model to generate several candidate tokens ahead, then verifying them in parallel with the larger target model. Accepted tokens are kept; rejected ones are regenerated from the target model's distribution.

Key Details

Because the target model can verify multiple tokens in a single forward pass (which takes roughly the same time as generating one token), accepted speculative tokens are essentially 'free.' This technique can achieve 2-3x speedups without changing the output distribution. Variants include self-speculative decoding (using early exit from the same model) and Medusa (adding lightweight prediction heads). It's particularly effective when the draft model has high acceptance rates.

Related Concepts

speculative decodinginference optimizationmodel distillation

← Back to AI Glossary

Last updated: March 5, 2026