Prompt Caching
Storing and reusing computed representations of repeated prompt prefixes to reduce latency and cost.
Overview
Prompt caching stores the computed key-value representations for frequently used prompt prefixes, avoiding redundant computation when the same system prompt or context is reused across requests. This can reduce time-to-first-token by 50-90% and significantly lower API costs.
Implementation
Cloud providers like Anthropic (prompt caching) and OpenAI offer automatic prompt caching for API users. Self-hosted solutions like vLLM implement prefix caching for shared prompts. The technique is especially valuable for applications with long system prompts, RAG systems with cached context, or batch processing with shared instructions.