AI Glossary

Inference Cost

The computational expense of running a trained AI model to generate predictions or responses.

Overview

Inference cost refers to the computational resources (GPU time, memory, energy) required to generate outputs from a trained model. For LLMs, inference cost is measured per token and varies significantly by model size, hardware, and optimization techniques. Inference costs typically exceed training costs over a model's lifetime.

Optimization

Reducing inference costs involves: Quantization: Lower precision arithmetic (INT8, INT4). Batching: Processing multiple requests simultaneously. Caching: Reusing computed representations. Distillation: Using smaller models. Speculative decoding: Using draft models for speed. Hardware: Purpose-built inference chips. The gap between training and inference efficiency drives much of the innovation in production AI systems.

← Back to AI Glossary