A 70 billion parameter language model typically requires 140 GB of memory when stored in 16-bit floating point -- far more than any single consumer GPU can handle. Quantization is the technique that makes these models accessible by representing weights with fewer bits, reducing memory requirements by 2-4x while maintaining most of the model's quality. It is the key technology enabling LLMs to run on laptops, gaming GPUs, and edge devices.
How Quantization Works
Neural network weights are normally stored as 16-bit floating point (FP16) or 32-bit floating point (FP32) numbers. Quantization converts these to lower precision formats like 8-bit integers (INT8) or 4-bit integers (INT4). The process maps a range of floating point values to a smaller set of discrete integer values.
For example, INT8 quantization maps FP16 weights to integers between -128 and 127, reducing memory per weight from 2 bytes to 1 byte. INT4 goes further, using only values between -8 and 7, reducing memory to 0.5 bytes per weight.
The memory savings are straightforward:
- FP16: 2 bytes per parameter (70B model = 140 GB)
- INT8: 1 byte per parameter (70B model = 70 GB)
- INT4: 0.5 bytes per parameter (70B model = 35 GB)
Quantization is like compressing an image from 24-bit color to 8-bit color. You lose some precision, but the overall picture remains recognizable and useful.
Key Takeaway
Quantization reduces model memory by 2-4x by using lower-precision number representations. INT4 quantization can fit a 70B parameter model in 35 GB, making it accessible on high-end consumer GPUs.
Popular Quantization Methods
GPTQ
GPTQ (Generative Pre-trained Transformer Quantization) is a post-training quantization method that uses a small calibration dataset to minimize the quantization error. It processes weights layer by layer, adjusting remaining weights to compensate for errors introduced by quantizing earlier weights. GPTQ is optimized for GPU inference and typically produces INT4 models with minimal quality loss.
GGUF (formerly GGML)
GGUF is the format used by llama.cpp for CPU and mixed CPU/GPU inference. It supports a wide range of quantization levels from Q2 (2-bit) to Q8 (8-bit), with popular choices being Q4_K_M and Q5_K_M that provide good quality-size trade-offs. GGUF's strength is its ability to run on CPUs, making LLMs accessible even without a GPU.
AWQ (Activation-Aware Weight Quantization)
AWQ observes that not all weights are equally important for model quality. It identifies "salient" weights (those corresponding to large activations) and protects them from aggressive quantization while quantizing less important weights more aggressively. This achieves better quality than uniform quantization at the same bit width.
AQLM and QuIP
These methods push quantization below 4 bits using codebook-based approaches. They achieve reasonable quality at 2-3 bits per weight, enabling a 70B model to fit in under 25 GB.
Quality Impact of Quantization
The impact on model quality depends on the quantization level:
- INT8: Typically less than 1% quality degradation on benchmarks. Nearly indistinguishable from FP16 in practice.
- INT4 (well-calibrated): 1-3% quality degradation. Still very usable for most applications.
- INT3 and below: 5-10%+ quality degradation. Noticeable but potentially acceptable for some use cases.
Larger models are generally more robust to quantization than smaller ones. A 70B model quantized to INT4 typically performs better than a 7B model at FP16, even though the 70B INT4 model uses less memory.
Practical Quantization Guide
Choosing the right quantization approach depends on your hardware and requirements:
- NVIDIA GPU with 24 GB VRAM: Use GPTQ or AWQ INT4 models. A 13B model fits easily; a 34B model fits with careful memory management.
- Apple Silicon Mac with 32-64 GB unified memory: Use GGUF format with llama.cpp or Ollama. Q4_K_M provides excellent quality. A 70B Q4 model can run on a 64 GB M-series Mac.
- CPU-only system: Use GGUF with llama.cpp. Expect slower generation but functional results. Q4_K_M for 7B-13B models works on most modern systems.
- Server deployment: Use FP8 or INT8 with TensorRT-LLM for the best speed-quality balance.
Key Takeaway
Quantization has democratized LLM access. INT4 quantization lets you run 70B models on consumer hardware with minimal quality loss, while INT8 is nearly lossless. Match your quantization method (GPTQ for GPU, GGUF for CPU/Mac) to your hardware.
Quantization continues to improve. New techniques push the boundary of how aggressively models can be compressed while maintaining quality, and hardware manufacturers are building native support for low-precision computation. The trend is clear: running powerful LLMs locally will only become easier and more accessible.
