eazyware
Engineering·June 17, 2024·10 min read

Contextual compression: cutting retrieved context in half

Retrieved chunks often contain 80% irrelevant text. Contextual compression extracts only relevant sentences before sending to the LLM. Cost and quality win.

KR
Kushal R.
Engineering lead

Contextual compression is a retrieval post-processing step that extracts only the relevant portions of retrieved chunks before sending to the LLM. Instead of shipping 5 complete 1000-token chunks (5000 tokens of context), you ship the 50-200 tokens actually relevant to the query. Cuts cost, improves quality, and sidesteps the 'lost in the middle' problem. This post covers the techniques and deployment patterns.

Compression pipeline
Contextual compression — extract before send Retrieved chunks 10 chunks × 500 tok = 5,000 tokens Compressor per-chunk: extract only relevant sentences Compressed 1,800 tokens (64% cut) higher signal ratio Implementation approaches 1. Small LLM extractor: Haiku/mini model extracts relevant sentences per chunk — fast, cheap 2. Sentence-level reranker: score each sentence by query relevance, keep top N 3. LLMLingua-style compression — automated token-level pruning tradeoff: extra call vs fewer tokens to final model — pays off above ~2K tokens input
Retrieve chunks to compressor (LLM or model) to relevant excerpts to LLM final answer. Compression happens between retrieval and generation.

Why compress retrieved context?

A chunk retrieved for relevance might be 90% irrelevant. The chunk mentions 'return policy' once in paragraph 3; the user asked about return policy. The other 900 tokens are about shipping, customer service, and product descriptions. LLM cost and attention are wasted on the irrelevant 900.

LLMs also suffer 'lost in the middle' — information in the middle of long contexts is less reliably used. Shorter, denser contexts often improve answer quality in addition to saving cost.

Compression techniques

LLM-based extraction. Pass query + chunk to a small LLM prompted to extract only the relevant sentences or paragraphs. Cheap (small model, short output). Reliable for most queries.

Embedding-based extraction. Score each sentence's relevance via embedding similarity to the query; keep top N. Faster, cheaper, slightly lower quality than LLM extraction.

LLMLingua-style compression. Token-level compression models identify and remove uninformative tokens. Works on any text; doesn't require query-specific extraction. See prompt compression post.

Hybrid: embedding-based first pass for speed, LLM-based second pass for precision on critical queries.

Cost-benefit math

Typical compression: 70-90% reduction in context tokens sent to the final LLM. For a 5-chunk context of 5000 tokens, compression sends 500-1500 tokens.

Added cost: the compression LLM call itself. For a fast small model, roughly $0.0002 per query. The savings on the main LLM call are typically 10-50x this.

Quality impact: typically positive. Focused context usually produces better answers. Occasionally negative when compression drops relevant nuance; rare with well-tuned prompts.

Implementation

LangChain and LlamaIndex both have contextual compression primitives. For custom implementations, the LLM prompt is simple: 'Given the query and the chunk, extract only sentences relevant to the query. Preserve key details. Output only the extracted text.'

Run compression in parallel across chunks. Each chunk is independent; parallel compression cuts wall time. For latency-sensitive systems, this matters.

Cache compressed outputs. Same query + same chunk → same compression. Cache hit rate on repeated queries is high. See caching patterns post.

Pitfalls

Over-compression. Aggressive prompts drop context the LLM actually needed. Calibrate: err toward preserving more when uncertain. Eval-guided tuning of compression prompts is critical.

Ignoring query specificity. Different queries need different context. A specific fact query needs just the fact; an analytical query might need surrounding reasoning. Compression prompt should consider query type.

Eval coverage. Compression changes model behavior in subtle ways. Run your eval suite against compressed vs uncompressed contexts. Look for regressions in specific question categories.

When compression earns its complexity

High-volume systems. Compression at 1000 queries/day saves tokens but the engineering effort doesn't pay back. At 100K queries/day, savings compound to meaningful dollars.

Long-context retrievers. If you're retrieving chunks of 2K+ tokens, the compression ROI is high. For short chunks (200-500 tokens), compression adds overhead without clear benefit.

Quality-sensitive systems. 'Lost in the middle' reduction alone can justify compression even when cost savings are modest.

Reranking (see reranking post) is a different approach to similar problems — selecting the most relevant chunks instead of compressing all of them. Compression can complement reranking: rerank to top 5, then compress each. Together they cut context by 90%+ without quality loss.

Read next
Prompt compression techniques that actually save tokens
Read next
Reranking models: the cross-encoder layer that transforms RAG
Read next
Six RAG patterns that actually work in production
Tags
contextual compressionRAGcost optimization
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request