Model Distillation
A technique where a smaller 'student' model is trained to mimic the behavior of a larger 'teacher' model, achieving comparable performance with fewer parameters.
How It Works
Train or obtain a large, high-quality teacher model. Generate soft probability distributions (soft labels) from the teacher on training data. Train a smaller student model to match these soft distributions. The soft labels contain richer information than hard labels.
Why Soft Labels Help
A teacher classifying a dog photo might output: dog 0.9, wolf 0.05, cat 0.03. These soft probabilities teach the student about inter-class similarities that hard labels (just 'dog') don't convey. This dark knowledge transfers the teacher's understanding.
Applications
Deploying LLMs on mobile devices. Reducing API serving costs. Creating fast models for real-time applications. DistilBERT achieved 97% of BERT's performance with 40% fewer parameters. Many production LLMs are distilled from larger models.