AI Glossary

Pruning

A model compression technique that removes unnecessary weights, neurons, or layers from a trained neural network to reduce size and improve inference speed.

How It Works

After training, many weights are near zero and contribute little to predictions. Pruning removes these weights (sets them to zero or removes entire neurons/attention heads). The model is then optionally fine-tuned to recover any lost performance.

Types

Unstructured pruning: Removes individual weights (creates sparse matrices). Structured pruning: Removes entire neurons, channels, or attention heads (easier to accelerate on hardware). Magnitude pruning: Removes smallest-magnitude weights.

Results

Models can often be pruned by 50-90% with minimal performance loss. Combined with quantization, pruning can shrink models by 10-20x, enabling deployment on edge devices.

← Back to AI Glossary

Last updated: March 5, 2026