What is Quantization?
Quantization is a model compression technique that reduces the numerical precision of a neural network's weights and activations. Instead of storing each number as a high-precision 32-bit floating-point value (FP32), quantization converts them to lower-precision formats like 16-bit (FP16), 8-bit integers (INT8), or even 4-bit integers (INT4).
The result is a model that is dramatically smaller, faster to run, and requires less memory -- all with surprisingly little loss in accuracy. A model quantized from FP32 to INT8, for example, is roughly four times smaller and can run two to four times faster, while typically losing less than 1% accuracy.
Think of it like reducing the resolution of an image. A photograph at 4K resolution contains enormous detail but takes up a lot of storage space. Downscale it to 1080p and you lose some fine detail, but the image is still perfectly usable and the file is much smaller. Quantization does the same thing for the numbers inside a neural network: it reduces precision while preserving the information that matters most.
Quantization has become essential as AI models grow ever larger. A model like Llama 2 with 70 billion parameters requires over 140 gigabytes of memory in FP32 format -- far too much for most consumer hardware. Quantized to INT4, the same model fits in about 35 gigabytes, making it possible to run on a single high-end GPU or even consumer-grade hardware. Quantization is what makes it possible to bring powerful AI to phones, laptops, and edge devices.
Why Shrink Models?
The need for quantization stems from a fundamental tension in modern AI: the most capable models are also the most expensive to run. Training a large language model might require thousands of GPUs for months, but deploying it for inference also demands significant resources. Every user query requires the model to perform billions of multiplications, each involving a weight and an activation value.
Memory is a primary bottleneck. A 175-billion-parameter model in FP32 requires about 700 gigabytes just to store the weights. That exceeds the memory of even the most expensive data center GPUs. Loading these weights from memory to the compute units is often the slowest part of inference -- the processor spends more time waiting for data than actually computing. Smaller numbers mean less data to move, which directly translates to faster inference.
Cost is a practical constraint. Running AI inference in the cloud costs money for every second of GPU time. If quantization makes your model four times faster, your inference cost drops by roughly 75%. For companies serving millions of AI requests per day, this can save millions of dollars per year. Quantization is not just a technical optimization -- it is an economic necessity.
Edge deployment requires small models. Smartphones, tablets, IoT devices, and embedded systems have strict constraints on memory, power, and compute. Running AI locally on these devices (instead of sending data to the cloud) has huge benefits for privacy, latency, and reliability. But fitting a useful model into 2 gigabytes of mobile RAM requires aggressive compression, and quantization is the primary tool for achieving it.
Environmental impact is another consideration. Large models consume enormous amounts of electricity for inference. Quantization reduces the energy needed per prediction, which matters when you multiply by billions of predictions per day across the industry. Making AI more efficient is not just good engineering -- it is responsible engineering.
How It Works: FP32 to INT8
To understand quantization, you need to understand how numbers are stored in computers. FP32 (32-bit floating point) uses 32 bits to represent each number, allowing for extremely fine-grained values with about 7 decimal digits of precision. This is the standard format used during model training because the optimization process (gradient descent) requires high precision to make small, careful weight adjustments.
INT8 (8-bit integer) uses only 8 bits per number and can represent just 256 different values (from -128 to 127 for signed integers, or 0 to 255 for unsigned). That is far fewer values than FP32, but it turns out to be enough for inference. The key insight is that neural networks are remarkably robust to small perturbations in their weights -- they do not need seven decimal digits of precision to make good predictions.
The quantization process works by mapping the range of FP32 values to INT8 values. First, you determine the minimum and maximum weight values in a given layer. Then you linearly scale that range to fit within the INT8 range of -128 to 127. Each FP32 weight is rounded to its nearest INT8 equivalent. The scaling factor and zero point are stored so that the INT8 values can be converted back (approximately) to FP32 during computation.
There are two main approaches to quantization. Post-Training Quantization (PTQ) takes a fully trained FP32 model and converts it to lower precision without any retraining. It is fast and easy -- you can quantize a model in minutes -- but may cause accuracy loss, especially for very aggressive quantization (e.g., INT4). PTQ works well for INT8 quantization of most models.
Quantization-Aware Training (QAT) simulates the effects of quantization during training. The model learns to compensate for the reduced precision, developing weights that are more robust to rounding errors. QAT requires more effort (you need to retrain or fine-tune the model) but generally produces better results, especially for lower-bit quantization. When quantizing to INT4 or INT2, QAT is often necessary to maintain acceptable accuracy.
Mixed-precision quantization applies different precision levels to different parts of the model. Sensitive layers (like the first and last layers, or attention layers) might remain in FP16 or even FP32, while less sensitive layers are quantized to INT8 or INT4. This approach captures most of the efficiency gains while minimizing the accuracy impact by keeping critical computations at higher precision.
Trade-offs
Quantization is not free. There is always a trade-off between model size, inference speed, and accuracy. Understanding these trade-offs is essential for making good decisions about when and how to quantize.
Accuracy loss is the primary concern. Every time you reduce precision, you introduce rounding errors that slightly change the model's predictions. For INT8 quantization of well-trained models, the accuracy loss is typically less than 1% -- often so small it is within measurement noise. For INT4 quantization, the loss can be 1-3%, which is still acceptable for many applications. For INT2 (binary quantization), the loss is significant and the technique is only practical for specialized models.
The type of model matters. Larger models are generally more robust to quantization because their redundancy provides a buffer against precision loss. A 70-billion-parameter model quantized to INT4 often performs better than a 7-billion-parameter model at FP32, because the larger model simply has more knowledge encoded in its weights. This is why quantization and model scaling go hand in hand.
Hardware support is a practical consideration. Not all processors can execute INT8 or INT4 operations efficiently. Modern GPUs (NVIDIA A100, H100), Apple Silicon (M1/M2/M3), and specialized AI accelerators (Google TPU) have dedicated hardware units for low-precision arithmetic. On hardware without this support, quantized models might not run any faster than their FP32 counterparts because the processor lacks the optimized instruction set.
Task sensitivity varies. Some tasks are more sensitive to quantization than others. Simple classification and detection tasks are generally robust to aggressive quantization. Tasks requiring precise numerical reasoning, subtle language understanding, or fine-grained generation may degrade more noticeably. Always benchmark your specific use case before and after quantization to ensure the trade-off is acceptable.
In the broader landscape of model compression, quantization works alongside other techniques like pruning (removing unnecessary weights), knowledge distillation (training a smaller model to mimic a larger one), and low-rank factorization (decomposing weight matrices into smaller ones). These techniques can be combined for even greater compression, often achieving 10-20x size reduction with minimal accuracy loss.
Key Takeaway
Quantization is one of the most important practical techniques for making AI accessible and affordable. By reducing the numerical precision of model weights from FP32 to INT8 or INT4, you can shrink models by 4-8x, speed up inference by 2-4x, and dramatically reduce memory and energy requirements -- all with minimal accuracy loss.
The two main approaches -- Post-Training Quantization for quick-and-easy compression and Quantization-Aware Training for maximum quality -- give you flexibility to balance effort and results. Mixed-precision approaches offer the best of both worlds by keeping sensitive layers at higher precision.
As models continue to grow, quantization becomes not just useful but essential. Without it, the largest models would be confined to the most expensive data centers. With it, powerful AI can run on consumer GPUs, mobile devices, and embedded systems. Quantization is the bridge between state-of-the-art research models and real-world deployment.
The next time you run a large language model locally on your laptop or use an AI feature on your phone, remember: quantization is what makes that possible. It is the engineering that turns a 140-gigabyte model into one that fits in your pocket.
Next: What is Regularization? →