AI Glossary

Knowledge Distillation

Training a smaller, efficient model to replicate the behavior of a larger model, transferring learned knowledge while dramatically reducing computational requirements.

Process

The teacher model generates soft probability distributions (logits) on training data. The student model trains to match these soft targets. Temperature scaling controls how much information soft labels convey. Often combined with standard hard-label training.

Types

Response-based: Match teacher outputs. Feature-based: Match intermediate representations. Relation-based: Match relationships between examples. Self-distillation: A model distills knowledge from its own deeper layers.

Results

DistilBERT: 60% of BERT's size, 97% of performance. Distilled LLMs can run on mobile devices. Many production AI systems use distilled models for cost-effective serving. Key technique for deploying AI at scale.

← Back to AI Glossary

Last updated: March 5, 2026