AI Glossary

Knowledge Distillation

A model compression technique where a smaller 'student' model is trained to mimic the behavior of a larger 'teacher' model, transferring knowledge into a more efficient form.

How It Works

The student learns from the teacher's soft probability outputs rather than hard labels. These soft targets contain rich information about the teacher's learned representations -- for example, that a '7' looks somewhat like a '1' but not at all like a '0'.

Applications

Deploying large model capabilities on mobile devices. Reducing inference costs in production. Creating efficient specialized models from general-purpose LLMs. DistilBERT achieves 97% of BERT's performance with 40% fewer parameters.

Distillation in LLMs

Many smaller open-source LLMs are distilled from larger models. The process involves generating training data from the teacher model and fine-tuning the student on it. This is a key technique behind the rapid proliferation of capable small models.

← Back to AI Glossary

Knowledge Distillation

How It Works

Applications

Distillation in LLMs

Related Articles

Deploying Computer Vision on Edge Devices

Small Language Models: When Bigger Isn't Better

The Environmental Impact of AI: Carbon Footprint of Training Models

Efficient Transformers: A Survey of Faster Architectures

Related Concepts