A user asks: 'What's your refund policy?' Another user asks: 'How do refunds work here?' Another: 'Can I get my money back?' All three get the same answer. All three should share one LLM call, not three. That's the one-sentence pitch for semantic caching — and it reliably cuts 30-50% of LLM cost in repetitive workflows, sometimes more. For one of our largest clients, semantic caching reduced their LLM bill 43% in six weeks.

Lookup flow

Embed query, check nearest cached entry by cosine similarity. Hit returns in ~80ms; miss falls through to the LLM and caches the result.

Why exact-match caching fails

Traditional caching matches on exact key. LLM inputs are natural language — users never phrase the same question twice. Exact-match caching hits maybe 1-5% of requests. Semantic caching matches on meaning via embeddings, hitting 20-50% on typical workloads.

How semantic caching works

On a request: embed the query. Look up the nearest neighbor in the cache. If similarity exceeds a threshold (typically 0.93-0.97 for most embedding models), return the cached response. Otherwise call the LLM and cache the result.

Key parameters: similarity threshold (higher = fewer false positives, lower hit rate; lower = more hits, more risk of wrong cache hits), cache size (how many entries to store), and TTL (when cached entries expire).

Tuning the threshold

The threshold is where this goes right or wrong. Too high: minimal hit rate, minimal savings. Too low: cache hits on semantically-similar-but-actually-different questions, users get wrong answers.

Start at 0.95 for OpenAI text-embedding-3-large or equivalent. Eval with a curated set of paraphrase pairs (should cache-hit) and distinct-but-similar pairs (should not cache-hit). Tune the threshold until the confusion matrix looks right. Usually lands between 0.93 and 0.96.

What to cache (and what not to)

Cache: factual queries with stable answers (policy questions, product information, definitional queries). Skip caching: personalized responses, time-sensitive queries, queries depending on user state.

Implementation tip: tag every query at the application layer as 'cacheable' or 'not.' Caching without this filter accidentally caches personalized responses and returns them to the wrong user — catastrophic. Explicit cacheability flags prevent this.

Per-tenant caches

In multi-tenant systems, segment caches by tenant. A query from tenant A should never match a cache entry from tenant B. Even if the queries are semantically identical, the responses are tenant-scoped. This feels obvious but is frequently skipped and causes confusing incidents when it is.

Cache invalidation

When the underlying knowledge changes (document updated, policy revised), cached responses based on old knowledge are stale. Patterns: TTL-based expiration (simplest, guarantees eventual freshness), event-based invalidation (invalidate when source documents change — tight but more complex), or explicit versioning (embed a knowledge-version in cache keys).

Tools

Redis + custom embedding layer: our most common deployment. Redis for fast key-value, small embedding index on top.
Vector databases: Pinecone, Qdrant with TTL features.
GPTCache: open-source semantic cache purpose-built for LLM. Quick start, production-serviceable.
Langfuse: has semantic caching as a feature in enterprise tiers.

Real numbers

From six client deployments over 12 months (average):

Cache hit rate: 32% (ranged 18% to 51%).
LLM cost reduction: 43% (ranged 25% to 62%).
Latency improvement on cache hits: 1.8s → 80ms.
Cache infrastructure cost: $100-$500/month for most deployments.
Time to break-even: 2-4 weeks from implementation.

Closing

Semantic caching is near-free money for any AI system with repetitive queries. Implementation is 2-4 weeks for first version. ROI is often >10x in the first year. Combined with multi-model routing, caching covers a huge fraction of the cost-optimization frontier. See the full TCO post for how caching fits the broader cost picture.

Semantic caching cut our biggest client's LLM bill 43%

Why exact-match caching fails

How semantic caching works

Tuning the threshold

What to cache (and what not to)

Per-tenant caches

Cache invalidation

Tools

Real numbers

Closing

Continue the thread.

Total cost of ownership for LLM systems

Multi-model routing: cutting LLM costs 40-60% with zero quality loss

Want to talk about this?