eazyware
Engineering·November 11, 2024·11 min read

LLM caching patterns: from exact-match to semantic caches

Exact-match, normalized, template, semantic, and hierarchical caches — when each fits and how to implement without poisoning responses.

KR
Kushal R.
Engineering lead

Caching LLM responses is the single highest-leverage optimization for cost and latency in production AI systems. The cache miss path costs dollars per 1000 requests; the cache hit path costs fractions of a cent. Getting the right caching pattern for your workload is worth serious design attention. This post covers the five patterns we deploy — from trivial exact-match to hierarchical semantic caches — with the pitfalls that can poison responses if you're not careful.

Five caching patterns
LLM caching — five patterns 1. Exact match key = hash(full prompt) hit rate: 5-15% simple, low yield 2. Normalized lowercase, dedupe ws hit rate: 15-25% 10min to implement 3. Template-based cache templates, fill vars hit rate: 30-50% structured queries only 4. Semantic embed query, sim > 0.97 hit rate: 40-60% needs eval guard 5. Hierarchical exact → template → semantic hit rate: 50-70% best cost / latency Cache keys · tenant_id (always) · user role (if matters) · model version poisoning risks: wrong answers served to wrong tenants · test isolation before shipping semantic layer
Exact match is trivial, semantic is powerful but risky, hierarchical combines all three for best hit rate. Always include tenant_id in cache keys.

Exact match cache

Hash the entire prompt; use as cache key. Hit rate is typically 5-15% — low, because prompts vary in subtle ways. But it's trivial to implement and has zero false-positive risk.

Best for: deterministic workflows where the same inputs recur (dashboard tiles, cron-based summaries, idempotent tool calls). Not useful for freeform chat or user-entered queries with high variance.

Normalized cache

Before hashing, normalize: lowercase, collapse whitespace, sort JSON fields, strip punctuation. Hit rate jumps to 15-25% for most workloads because trivial variations now hit the same key.

Takes under an hour to implement on top of exact match. Noticeable improvement. Low risk because normalization preserves semantics.

Template-based cache

For structured queries, parse out the template and variables. Cache at the template level with variable substitution. 'Summarize customer {X} profile' with X changing across requests can share a cached approach and fill the specifics.

Hit rates of 30-50% for heavily templated systems (support bots, internal tools, copilots with structured inputs). Requires query parsing or structured input channels — doesn't work for freeform prompts.

Semantic cache

Embed the query; find cached queries with cosine similarity > threshold (we use 0.97 typically); return the cached response. Hit rates of 40-60% in typical chat/support workloads.

Biggest win, biggest risk. Two queries with high similarity can still have different correct answers — 'What is our return policy?' and 'What is our return policy in California?' are similar but answer differently. Guardrails: high similarity threshold, eval coverage for cache hits, quick kill-switch if issues surface. See semantic caching post.

Hierarchical cache

Try exact match first (cheapest), then template, then semantic. Each layer protects the next: exact and template hits are safe; semantic gets fewer chances to produce wrong answers because most hits are resolved earlier.

Hit rates of 50-70% in practice, with most hits landing on the safer layers. Deployment complexity is higher but the economics justify it for high-volume systems.

Cache poisoning and multi-tenancy

Cache keys must include tenant_id. Always. Two tenants asking 'What's our HR policy?' must get different answers. Semantic caches that ignore tenant dimension become cross-tenant leaks. See multi-tenancy post.

Additional key components: user role (if it affects answers), model version, key config (temperature, system prompt version). Leaving any out creates subtle bugs that are hard to trace later.

Invalidation — the hard problem

When underlying data changes, caches go stale. RAG caches especially: if the knowledge base updated, cached answers based on old content are wrong.

Patterns: TTL-based expiry (simple, blunt); event-driven invalidation (precise, requires instrumentation on data sources); versioned cache keys (rebuild cache on data version bump). We default to versioned keys for RAG caches and TTL for general semantic caches.

Measuring cache effectiveness

Track per-cache: hit rate, bytes stored, inferred cost savings vs no-cache, user-visible latency improvement, any detected stale-response incidents. Caches that don't pay back are surprisingly common — a 5% hit rate rarely justifies the complexity.

Read next
Semantic caching cut our biggest client's LLM bill 43%
Read next
Total cost of ownership for LLM systems
Read next
Multi-tenancy for AI applications: isolation patterns
Tags
cachingperformancecost optimizationLLM ops
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request