Knowledge Distillation
Training a smaller student model to replicate the behavior of a larger teacher model.
Overview
Knowledge distillation is a technique where a smaller 'student' model is trained to mimic the outputs of a larger, more capable 'teacher' model. Rather than training on hard labels alone, the student learns from the teacher's soft probability distributions (softened by a temperature parameter), which contain richer information about inter-class relationships.
Key Details
Distillation enables deploying smaller, faster models that retain much of the teacher's performance. In the LLM era, distillation is used to create efficient models from frontier models — for example, training a 7B parameter model on outputs from a 70B model. Variants include response distillation (matching outputs), feature distillation (matching intermediate representations), and online distillation (training simultaneously). Note: many LLM providers prohibit using their outputs for distillation in their terms of service.