Pruning
A model compression technique that removes unnecessary weights, neurons, or layers from a trained neural network to reduce size and improve inference speed.
How It Works
After training, many weights are near zero and contribute little to predictions. Pruning removes these weights (sets them to zero or removes entire neurons/attention heads). The model is then optionally fine-tuned to recover any lost performance.
Types
Unstructured pruning: Removes individual weights (creates sparse matrices). Structured pruning: Removes entire neurons, channels, or attention heads (easier to accelerate on hardware). Magnitude pruning: Removes smallest-magnitude weights.
Results
Models can often be pruned by 50-90% with minimal performance loss. Combined with quantization, pruning can shrink models by 10-20x, enabling deployment on edge devices.