eazyware
Engineering·November 18, 2024·10 min read

Prompt compression techniques that actually save tokens

LLMLingua, selective context, summarization layers, and pattern templating. The techniques that cut input tokens 30-60% without quality loss.

KR
Kushal R.
Engineering lead

Prompt compression is one of the highest-leverage cost optimizations available to teams running LLMs in production. A system prompt of 2,000 tokens that runs on every request quietly dominates the bill. The techniques in this post routinely cut input tokens 30-70% without meaningful quality loss — and compound with caching and routing to drop overall inference cost by half or more. We use these in client deployments handling millions of daily requests, and this is the honest ranking of what actually works.

Technique comparison
Prompt compression techniques — token savings TECHNIQUE SAVINGS QUALITY IMPACT System prompt caching 80-90% on repeat zero — identical output LLMLingua compression 40-60% small — eval-verify Selective context (rerank) 30-50% often improves quality Summary hierarchy 20-40% moderate — test Template & var replace 10-20% zero — pure refactor combine: caching + rerank + templating stacks to 70%+ reduction on repeat prompts
Five compression techniques ranked by savings and quality impact. System prompt caching leads on pure savings with zero quality risk. Combining techniques stacks to 70%+ reduction.

System prompt caching (start here)

Before any algorithmic compression, use the prompt caching features offered by major providers. OpenAI, Anthropic, and Google all cache identical prompt prefixes and charge 50-90% less on cache hits. If your system prompt is stable across requests (which it should be), you get immediate savings with zero quality risk.

Implementation: put your stable context at the top — system instructions, tool definitions, shared examples — and variable user content at the bottom. This maximizes prefix reuse. In our deployments, this alone drops the effective cost of a 3,000-token system prompt to ~300-600 equivalent tokens over a day of traffic.

LLMLingua and prompt compression models

LLMLingua and its successors (LongLLMLingua, LLMLingua-2) compress prompts by identifying and removing tokens that contribute little to model output. Compression ratios of 2-10x are achievable on long prompts with bounded quality loss.

When it works: long, repetitive prompts (documentation, multiple examples, verbose instructions). Less effective on dense, information-rich prompts. Always eval-verify — the model's behavior on compressed prompts can diverge subtly on edge cases. See eval infrastructure post.

Selective context and reranking

For RAG systems, aggressive reranking after retrieval delivers both cost savings and quality improvements. Retrieve 20 candidates; rerank to top 3-5; send only those to the LLM. You cut context by 60-75% and often get better answers because the model focuses on the most relevant information.

Our default reranker stack: BM25 retrieval, vector retrieval, cross-encoder reranker (bge-reranker-v2 or Cohere Rerank), final context of 3-5 chunks. See RAG patterns post and hybrid search post for architecture.

Summary hierarchy

Long conversation histories are a classic bloat source. A chat with 50 turns of context sends the full history every turn. Instead: after every N turns, summarize the older portion into a compact summary; keep only the last 5-10 turns verbatim.

This preserves recent context fidelity while compacting older context. Trades some detail for significant token savings. Works well for support bots, longer copilot sessions, any system where conversation history accumulates.

Templates and variable substitution

Many teams have prompts that include redundant boilerplate in each request — restating the task, re-listing examples, re-specifying output format. Extract the stable parts into system prompts (cached); use the user prompt only for variable content.

Audit your prompts periodically. Count tokens in the stable prefix versus variable content. If the prefix is 80% and the variable part is 20%, you're paying for the prefix on every request when you shouldn't need to.

Context window engineering

Not every context needs the full window. A 128K context window is expensive even with caching. Right-size your context to the task. Most production tasks (classification, extraction, Q&A) need 2-8K tokens of context. Reaching for 128K because the model supports it is a cost-and-quality anti-pattern.

The 'lost in the middle' problem is real: models attend less to tokens in the middle of long contexts. Shorter, denser contexts often outperform longer, sparser ones both on cost and quality. See context window engineering for deeper patterns.

Measure before optimizing

Before compressing, instrument. For each endpoint, track: average input tokens per request, average output tokens, cost per 1000 requests. Identify the top 3 endpoints by cost. Those are where compression pays off.

After compressing, eval: run your suite against old and new prompts. Quality delta within acceptable bounds? Ship. Quality regression? Roll back. Compression that degrades quality for a user-visible metric is a bad trade, regardless of savings.

Combining techniques

The multipliers stack. Caching (60% reduction on effective cost) × rerank-to-3 (70% reduction in context sent) × template extraction (10% reduction) compounds to roughly 90% cost reduction on the affected endpoints. Not every endpoint supports all techniques; pick the ones that fit each workload.

Read next
Total cost of ownership for LLM systems
Read next
Semantic caching cut our biggest client's LLM bill 43%
Read next
Context window engineering: working within and beyond the limits
Tags
promptscompressioncost optimizationtokens
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request