AI Glossary

Speculative Decoding

An inference acceleration technique where a smaller, faster model drafts multiple tokens that are then verified in parallel by the larger model.

How It Works

A small 'draft' model generates several candidate tokens quickly. The large 'target' model verifies all candidates in a single parallel forward pass (much faster than generating one-by-one). Accepted tokens are kept; rejected tokens trigger regeneration from the rejection point.

Benefits

2-3x speedup with identical output quality (the target model's distribution is preserved). Works with any model pair. Particularly effective when the draft model is good at predicting common continuations.

← Back to AI Glossary

Last updated: March 5, 2026