AI Glossary

Model Serving

The infrastructure and process of deploying trained ML models to handle real-time or batch prediction requests in production environments.

Serving Patterns

Real-time: REST/gRPC APIs that return predictions in milliseconds (product recommendations, chatbots). Batch: Processing large datasets offline (daily report generation, bulk scoring). Streaming: Processing continuous data flows (fraud detection, IoT).

Infrastructure

NVIDIA Triton, TensorFlow Serving, TorchServe, vLLM (for LLMs), and managed services like SageMaker Endpoints, Vertex AI, and Replicate. For LLMs specifically, vLLM and TGI (Text Generation Inference) are standard.

Optimization

Model quantization (reduce precision), batching (group requests), caching (store common predictions), GPU sharing, and autoscaling to handle variable load.

← Back to AI Glossary

Model Serving

Serving Patterns

Infrastructure

Optimization

Related Articles

Model Deployment: From Jupyter to Production APIs

Model Monitoring in Production: Detecting Drift and Degradation

Related Concepts