Speculative Decoding
An inference acceleration technique where a smaller, faster model drafts multiple tokens that are then verified in parallel by the larger model.
How It Works
A small 'draft' model generates several candidate tokens quickly. The large 'target' model verifies all candidates in a single parallel forward pass (much faster than generating one-by-one). Accepted tokens are kept; rejected tokens trigger regeneration from the rejection point.
Benefits
2-3x speedup with identical output quality (the target model's distribution is preserved). Works with any model pair. Particularly effective when the draft model is good at predicting common continuations.