Contextual compression is a retrieval post-processing step that extracts only the relevant portions of retrieved chunks before sending to the LLM. Instead of shipping 5 complete 1000-token chunks (5000 tokens of context), you ship the 50-200 tokens actually relevant to the query. Cuts cost, improves quality, and sidesteps the 'lost in the middle' problem. This post covers the techniques and deployment patterns.

Compression pipeline

Retrieve chunks to compressor (LLM or model) to relevant excerpts to LLM final answer. Compression happens between retrieval and generation.

Why compress retrieved context?

A chunk retrieved for relevance might be 90% irrelevant. The chunk mentions 'return policy' once in paragraph 3; the user asked about return policy. The other 900 tokens are about shipping, customer service, and product descriptions. LLM cost and attention are wasted on the irrelevant 900.

LLMs also suffer 'lost in the middle' — information in the middle of long contexts is less reliably used. Shorter, denser contexts often improve answer quality in addition to saving cost.

Compression techniques

LLM-based extraction. Pass query + chunk to a small LLM prompted to extract only the relevant sentences or paragraphs. Cheap (small model, short output). Reliable for most queries.

Embedding-based extraction. Score each sentence's relevance via embedding similarity to the query; keep top N. Faster, cheaper, slightly lower quality than LLM extraction.

LLMLingua-style compression. Token-level compression models identify and remove uninformative tokens. Works on any text; doesn't require query-specific extraction. See prompt compression post.

Hybrid: embedding-based first pass for speed, LLM-based second pass for precision on critical queries.

Cost-benefit math

Typical compression: 70-90% reduction in context tokens sent to the final LLM. For a 5-chunk context of 5000 tokens, compression sends 500-1500 tokens.

Added cost: the compression LLM call itself. For a fast small model, roughly $0.0002 per query. The savings on the main LLM call are typically 10-50x this.

Quality impact: typically positive. Focused context usually produces better answers. Occasionally negative when compression drops relevant nuance; rare with well-tuned prompts.

Implementation

LangChain and LlamaIndex both have contextual compression primitives. For custom implementations, the LLM prompt is simple: 'Given the query and the chunk, extract only sentences relevant to the query. Preserve key details. Output only the extracted text.'

Run compression in parallel across chunks. Each chunk is independent; parallel compression cuts wall time. For latency-sensitive systems, this matters.

Cache compressed outputs. Same query + same chunk → same compression. Cache hit rate on repeated queries is high. See caching patterns post.

Pitfalls

Over-compression. Aggressive prompts drop context the LLM actually needed. Calibrate: err toward preserving more when uncertain. Eval-guided tuning of compression prompts is critical.

Ignoring query specificity. Different queries need different context. A specific fact query needs just the fact; an analytical query might need surrounding reasoning. Compression prompt should consider query type.

Eval coverage. Compression changes model behavior in subtle ways. Run your eval suite against compressed vs uncompressed contexts. Look for regressions in specific question categories.

When compression earns its complexity

High-volume systems. Compression at 1000 queries/day saves tokens but the engineering effort doesn't pay back. At 100K queries/day, savings compound to meaningful dollars.

Long-context retrievers. If you're retrieving chunks of 2K+ tokens, the compression ROI is high. For short chunks (200-500 tokens), compression adds overhead without clear benefit.

Quality-sensitive systems. 'Lost in the middle' reduction alone can justify compression even when cost savings are modest.

Reranking (see reranking post) is a different approach to similar problems — selecting the most relevant chunks instead of compressing all of them. Compression can complement reranking: rerank to top 5, then compress each. Together they cut context by 90%+ without quality loss.

Contextual compression: cutting retrieved context in half

Why compress retrieved context?

Compression techniques

Cost-benefit math

Implementation

Pitfalls

When compression earns its complexity

Continue the thread.

Prompt compression techniques that actually save tokens

Six RAG patterns that actually work in production

Chunking strategies: the unglamorous key to RAG quality

Want to talk about this?

Contextual compression: cutting retrieved context in half

Why compress retrieved context?

Compression techniques

Cost-benefit math

Implementation

Pitfalls

When compression earns its complexity

Related patterns

Continue the thread.

Prompt compression techniques that actually save tokens

Six RAG patterns that actually work in production

Chunking strategies: the unglamorous key to RAG quality

Want to talk about this?