Knowledge Distillation
Training a smaller, efficient model to replicate the behavior of a larger model, transferring learned knowledge while dramatically reducing computational requirements.
Process
The teacher model generates soft probability distributions (logits) on training data. The student model trains to match these soft targets. Temperature scaling controls how much information soft labels convey. Often combined with standard hard-label training.
Types
Response-based: Match teacher outputs. Feature-based: Match intermediate representations. Relation-based: Match relationships between examples. Self-distillation: A model distills knowledge from its own deeper layers.
Results
DistilBERT: 60% of BERT's size, 97% of performance. Distilled LLMs can run on mobile devices. Many production AI systems use distilled models for cost-effective serving. Key technique for deploying AI at scale.