eazyware
Engineering·July 29, 2024·9 min read

Prompt cache warming: getting 90% hit rates from cold start

Provider prompt caches save 90% on repeat prefixes. How to warm them intentionally, measure hit rates, and design prompts for maximum cacheability.

KR
Kushal R.
Engineering lead

Prompt caching at the provider level (Anthropic, OpenAI, Google) is the easiest 50-90% cost cut available. The catch: cold starts. A fresh deployment or an idle cache pays full price on the first requests while the cache populates. Warming the cache intentionally avoids this. This post is the specific warming patterns we deploy, how to measure hit rates, and the prompt design principles that maximize cacheability.

Cold vs warmed
Prompt cache warming — 90% hit rate pattern Cold start (no warming) First requests pay full input cost Cache populates automatically Hit rate rises from 0% over minutes Deploy surge = expensive Warmed cache Cache pre-populated before traffic First real request hits cache 90% hit rate from t=0 Savings start immediately Warming strategy 1. Identify stable prompt prefixes (system, tool defs, examples) 2. Send warm-up requests using these prefixes post-deploy 3. Monitor cache_read_tokens metric — track hit rate, alert on drops
Cold start pays full input cost; hit rate rises over minutes. Warmed cache achieves 90% hit rate from t=0. Warming strategy: pre-populate with stable prompt prefixes.

How provider caching works

Providers cache prompt prefixes. When a request's prefix matches a recently-cached prefix, the provider charges 10-25% of the normal input token cost for the cached portion. Anthropic's prompt caching, OpenAI's prompt caching, Google's context caching all work similarly.

Cache lifetimes vary — typically 5-10 minutes of inactivity, then the cache expires. Active use extends the lifetime.

Cache hits depend on byte-exact prefix match. If your system prompt has a timestamp that changes every request, you defeat caching. Design for cacheability: stable context at the top, variable content at the bottom.

The cold start problem

After deployment, cache is empty. The first N requests pay full input cost. If those happen to be high-traffic periods, you're paying 5-10x more during the cold-start window than steady state.

Same after quiet periods. A system that idles for 20 minutes then gets a burst pays cold-start cost on the burst. If your traffic is bursty (common in B2B — everyone logs in at 9am), you're paying cold-start cost daily.

Warming patterns

Post-deploy warming. Immediately after a deploy, fire N requests using your most-common prompt prefix. This populates the cache. When real traffic arrives, hit rate is already at 90%+.

Periodic warming. A background job sends a request every 4-5 minutes to keep the cache alive during low-traffic periods. Cheap — one request every 5 minutes vs hundreds of full-price requests when the cache dies.

Multi-prefix warming. If your system has several distinct prompt prefixes (different features or models), warm each separately. One warming job per prefix.

Prompt design for maximum cacheability

Stable content at the top. System instructions, tool definitions, examples, reference documents. These should be byte-identical across requests. See prompt compression post.

Variable content at the bottom. User query, session context, anything that changes per request. This portion doesn't benefit from caching, but it doesn't defeat caching of the stable prefix above it.

Avoid dynamic values in stable sections. A timestamp in the system prompt defeats caching even if everything else is identical. Resist the temptation to add dynamic context unless necessary.

RAG context: tricky. Retrieved chunks vary per query. Options: put RAG context below the stable system prompt (stable part still caches); or use smaller, more stable context (e.g., category-level summaries that don't change per query).

Measuring cache hit rate

Providers return cache_read_tokens in response metadata. Divide cache_read_tokens by total_input_tokens; that's your hit rate per request.

Track aggregate hit rate in your observability. Dashboard: hit rate over time, per-endpoint, per-model. Drops in hit rate signal prompt changes or traffic pattern shifts.

Alert on cache hit rate drops below a threshold (80% is a reasonable bar for systems with stable prompts). Sudden drops indicate something changed — investigate before the bill explodes.

Limitations

Caching doesn't speed up generation. It reduces input cost. Latency improvements are modest (10-20% faster TTFT in some cases).

Small prompts don't benefit. Below ~1000 tokens of cacheable prefix, the overhead dominates. Provider caching targets large stable prefixes.

Output doesn't cache. Only input. If you're generating identical outputs repeatedly, use semantic caching on top. See caching patterns post.

When warming is worth the effort

Bursty traffic patterns (cache dies between bursts). Multi-minute cold-start windows on heavy endpoints. Systems where cost is a meaningful concern (over $5K/month in input token costs).

For steady, always-on traffic, the cache stays warm on its own and warming adds little value.

Read next
LLM caching patterns: from exact-match to semantic caches
Read next
Prompt compression techniques that actually save tokens
Read next
Total cost of ownership for LLM systems
Tags
cachingcost optimizationprompts
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request