AI Glossary

Model Distillation

A technique where a smaller 'student' model is trained to mimic the behavior of a larger 'teacher' model, achieving comparable performance with fewer parameters.

How It Works

Train or obtain a large, high-quality teacher model. Generate soft probability distributions (soft labels) from the teacher on training data. Train a smaller student model to match these soft distributions. The soft labels contain richer information than hard labels.

Why Soft Labels Help

A teacher classifying a dog photo might output: dog 0.9, wolf 0.05, cat 0.03. These soft probabilities teach the student about inter-class similarities that hard labels (just 'dog') don't convey. This dark knowledge transfers the teacher's understanding.

Applications

Deploying LLMs on mobile devices. Reducing API serving costs. Creating fast models for real-time applications. DistilBERT achieved 97% of BERT's performance with 40% fewer parameters. Many production LLMs are distilled from larger models.

← Back to AI Glossary

Model Distillation

How It Works

Why Soft Labels Help

Applications

Related Articles

Model Deployment: From Jupyter to Production APIs

Model Monitoring in Production: Detecting Drift and Degradation

Related Concepts