AI Glossary

Prompt Caching

Storing and reusing computed representations of repeated prompt prefixes to reduce latency and cost.

Overview

Prompt caching stores the computed key-value representations for frequently used prompt prefixes, avoiding redundant computation when the same system prompt or context is reused across requests. This can reduce time-to-first-token by 50-90% and significantly lower API costs.

Implementation

Cloud providers like Anthropic (prompt caching) and OpenAI offer automatic prompt caching for API users. Self-hosted solutions like vLLM implement prefix caching for shared prompts. The technique is especially valuable for applications with long system prompts, RAG systems with cached context, or batch processing with shared instructions.

← Back to AI Glossary