Caching LLM responses is the single highest-leverage optimization for cost and latency in production AI systems. The cache miss path costs dollars per 1000 requests; the cache hit path costs fractions of a cent. Getting the right caching pattern for your workload is worth serious design attention. This post covers the five patterns we deploy — from trivial exact-match to hierarchical semantic caches — with the pitfalls that can poison responses if you're not careful.
Exact match cache
Hash the entire prompt; use as cache key. Hit rate is typically 5-15% — low, because prompts vary in subtle ways. But it's trivial to implement and has zero false-positive risk.
Best for: deterministic workflows where the same inputs recur (dashboard tiles, cron-based summaries, idempotent tool calls). Not useful for freeform chat or user-entered queries with high variance.
Normalized cache
Before hashing, normalize: lowercase, collapse whitespace, sort JSON fields, strip punctuation. Hit rate jumps to 15-25% for most workloads because trivial variations now hit the same key.
Takes under an hour to implement on top of exact match. Noticeable improvement. Low risk because normalization preserves semantics.
Template-based cache
For structured queries, parse out the template and variables. Cache at the template level with variable substitution. 'Summarize customer {X} profile' with X changing across requests can share a cached approach and fill the specifics.
Hit rates of 30-50% for heavily templated systems (support bots, internal tools, copilots with structured inputs). Requires query parsing or structured input channels — doesn't work for freeform prompts.
Semantic cache
Embed the query; find cached queries with cosine similarity > threshold (we use 0.97 typically); return the cached response. Hit rates of 40-60% in typical chat/support workloads.
Biggest win, biggest risk. Two queries with high similarity can still have different correct answers — 'What is our return policy?' and 'What is our return policy in California?' are similar but answer differently. Guardrails: high similarity threshold, eval coverage for cache hits, quick kill-switch if issues surface. See semantic caching post.
Hierarchical cache
Try exact match first (cheapest), then template, then semantic. Each layer protects the next: exact and template hits are safe; semantic gets fewer chances to produce wrong answers because most hits are resolved earlier.
Hit rates of 50-70% in practice, with most hits landing on the safer layers. Deployment complexity is higher but the economics justify it for high-volume systems.
Cache poisoning and multi-tenancy
Cache keys must include tenant_id. Always. Two tenants asking 'What's our HR policy?' must get different answers. Semantic caches that ignore tenant dimension become cross-tenant leaks. See multi-tenancy post.
Additional key components: user role (if it affects answers), model version, key config (temperature, system prompt version). Leaving any out creates subtle bugs that are hard to trace later.
Invalidation — the hard problem
When underlying data changes, caches go stale. RAG caches especially: if the knowledge base updated, cached answers based on old content are wrong.
Patterns: TTL-based expiry (simple, blunt); event-driven invalidation (precise, requires instrumentation on data sources); versioned cache keys (rebuild cache on data version bump). We default to versioned keys for RAG caches and TTL for general semantic caches.
Measuring cache effectiveness
Track per-cache: hit rate, bytes stored, inferred cost savings vs no-cache, user-visible latency improvement, any detected stale-response incidents. Caches that don't pay back are surprisingly common — a 5% hit rate rarely justifies the complexity.