AI Glossary

Weight Quantization

Reducing the numerical precision of model weights (e.g., from 32-bit to 4-bit) to decrease model size and increase speed.

Overview

Weight quantization converts model parameters from higher-precision formats (FP32, FP16) to lower-precision formats (INT8, INT4, or even lower). This dramatically reduces model size (4x-8x) and increases inference speed by allowing more computations per memory access and using efficient integer arithmetic.

Techniques

Post-training quantization (PTQ): Quantize after training using calibration data. Quantization-aware training (QAT): Simulate quantization during training for better accuracy. GPTQ/AWQ/GGUF: Popular formats for LLM quantization. QLoRA: Combines 4-bit quantization with LoRA fine-tuning. Modern quantization enables running 70B parameter models on consumer GPUs.

← Back to AI Glossary