Prompt compression is one of the highest-leverage cost optimizations available to teams running LLMs in production. A system prompt of 2,000 tokens that runs on every request quietly dominates the bill. The techniques in this post routinely cut input tokens 30-70% without meaningful quality loss — and compound with caching and routing to drop overall inference cost by half or more. We use these in client deployments handling millions of daily requests, and this is the honest ranking of what actually works.
System prompt caching (start here)
Before any algorithmic compression, use the prompt caching features offered by major providers. OpenAI, Anthropic, and Google all cache identical prompt prefixes and charge 50-90% less on cache hits. If your system prompt is stable across requests (which it should be), you get immediate savings with zero quality risk.
Implementation: put your stable context at the top — system instructions, tool definitions, shared examples — and variable user content at the bottom. This maximizes prefix reuse. In our deployments, this alone drops the effective cost of a 3,000-token system prompt to ~300-600 equivalent tokens over a day of traffic.
LLMLingua and prompt compression models
LLMLingua and its successors (LongLLMLingua, LLMLingua-2) compress prompts by identifying and removing tokens that contribute little to model output. Compression ratios of 2-10x are achievable on long prompts with bounded quality loss.
When it works: long, repetitive prompts (documentation, multiple examples, verbose instructions). Less effective on dense, information-rich prompts. Always eval-verify — the model's behavior on compressed prompts can diverge subtly on edge cases. See eval infrastructure post.
Selective context and reranking
For RAG systems, aggressive reranking after retrieval delivers both cost savings and quality improvements. Retrieve 20 candidates; rerank to top 3-5; send only those to the LLM. You cut context by 60-75% and often get better answers because the model focuses on the most relevant information.
Our default reranker stack: BM25 retrieval, vector retrieval, cross-encoder reranker (bge-reranker-v2 or Cohere Rerank), final context of 3-5 chunks. See RAG patterns post and hybrid search post for architecture.
Summary hierarchy
Long conversation histories are a classic bloat source. A chat with 50 turns of context sends the full history every turn. Instead: after every N turns, summarize the older portion into a compact summary; keep only the last 5-10 turns verbatim.
This preserves recent context fidelity while compacting older context. Trades some detail for significant token savings. Works well for support bots, longer copilot sessions, any system where conversation history accumulates.
Templates and variable substitution
Many teams have prompts that include redundant boilerplate in each request — restating the task, re-listing examples, re-specifying output format. Extract the stable parts into system prompts (cached); use the user prompt only for variable content.
Audit your prompts periodically. Count tokens in the stable prefix versus variable content. If the prefix is 80% and the variable part is 20%, you're paying for the prefix on every request when you shouldn't need to.
Context window engineering
Not every context needs the full window. A 128K context window is expensive even with caching. Right-size your context to the task. Most production tasks (classification, extraction, Q&A) need 2-8K tokens of context. Reaching for 128K because the model supports it is a cost-and-quality anti-pattern.
The 'lost in the middle' problem is real: models attend less to tokens in the middle of long contexts. Shorter, denser contexts often outperform longer, sparser ones both on cost and quality. See context window engineering for deeper patterns.
Measure before optimizing
Before compressing, instrument. For each endpoint, track: average input tokens per request, average output tokens, cost per 1000 requests. Identify the top 3 endpoints by cost. Those are where compression pays off.
After compressing, eval: run your suite against old and new prompts. Quality delta within acceptable bounds? Ship. Quality regression? Roll back. Compression that degrades quality for a user-visible metric is a bad trade, regardless of savings.
Combining techniques
The multipliers stack. Caching (60% reduction on effective cost) × rerank-to-3 (70% reduction in context sent) × template extraction (10% reduction) compounds to roughly 90% cost reduction on the affected endpoints. Not every endpoint supports all techniques; pick the ones that fit each workload.