Knowledge Distillation
A model compression technique where a smaller 'student' model is trained to mimic the behavior of a larger 'teacher' model, transferring knowledge into a more efficient form.
How It Works
The student learns from the teacher's soft probability outputs rather than hard labels. These soft targets contain rich information about the teacher's learned representations -- for example, that a '7' looks somewhat like a '1' but not at all like a '0'.
Applications
Deploying large model capabilities on mobile devices. Reducing inference costs in production. Creating efficient specialized models from general-purpose LLMs. DistilBERT achieves 97% of BERT's performance with 40% fewer parameters.
Distillation in LLMs
Many smaller open-source LLMs are distilled from larger models. The process involves generating training data from the teacher model and fine-tuning the student on it. This is a key technique behind the rapid proliferation of capable small models.