AI Glossary

INT4/INT8 Quantization

Representing model weights or activations as 4-bit or 8-bit integers instead of 16/32-bit floats, dramatically reducing model size and inference cost.

INT8

Reduces model size by 4x vs FP32, with minimal quality loss for most models. Standard for production inference. Supported by TensorRT, ONNX Runtime, and most serving frameworks.

INT4

Reduces model size by 8x vs FP32. Quality loss is more noticeable but often acceptable. GPTQ, AWQ, and GGUF are popular INT4 quantization formats. Enables running 70B parameter models on consumer GPUs.

← Back to AI Glossary