Context windows kept growing — 32K, 128K, 200K, 1M, 2M. Each leap came with a wave of 'you don't need RAG anymore' takes. Each of those takes was wrong. Long context is genuinely useful, but it fails in predictable ways that the benchmark numbers hide. This post is how we actually use long context in production, including when we still reach for retrieval.
The middle-of-context problem
Published benchmarks for long-context models tend to use needle-in-a-haystack tests that are deliberately easy. Retrieval from real documents shows a well-documented 'lost in the middle' pattern: accuracy drops meaningfully for information in the middle third of the context window. This holds across model families and hasn't been fully solved even by frontier 2026 models.
The practical implication: you can't just stuff a 200K-token context window with documents and trust the model to retrieve what matters. You need to either (a) rank documents by relevance and put the most relevant at the boundaries, (b) use structural signals (headers, repetition, explicit reference) to anchor important content, or (c) still use retrieval to select 5-20 high-relevance chunks and leave the rest out.
When long context works well
Single-document deep analysis
Reading a 100-page contract, a 50-page research report, a long codebase file. The document is coherent; you want the model to have full context; chunking would lose cross-references. Long context wins here over chunked RAG, consistently.
Multi-turn conversations
Keeping an entire customer support conversation, code editing session, or tutoring session in context. Context grows during the interaction; retrieval from a previous turn is usually implicit. Long context is the right primitive.
Examples-heavy prompting
When few-shot examples dramatically improve quality, you often want many of them. A 50-example prompt in a 50K token context is a legitimate technique, and long context makes it affordable.
When long context fails
Multi-document synthesis
Comparing 20 documents, extracting common themes, finding contradictions. Naive long context stuffing underperforms RAG-then-synthesize because the middle-of-context problem hits the documents you care about. Mitigation: retrieval ranks; long context for the top-ranked subset.
High-volume bulk processing
Processing 10M documents where each document is independently small. Long context here is just expensive — pay for a smaller context window and parallelize.
Cost-sensitive production at scale
Long context is priced per token. A 100K-token prompt at $3 per million input tokens is $0.30 per call. At even modest volume (10K calls/day) that's $3K/day just in input cost. RAG with a well-tuned retriever often delivers equal quality at $0.02 per call.
The hybrid pattern we use
Most of our production systems use a hybrid: retrieval-first, long-context for the retrieved subset. Retrieve 20-50 candidates with hybrid search, rerank, take the top 5-10, pass them all at full fidelity into the model's context. This gets you the quality benefit of 'the model sees everything relevant' without the cost or middle-of-context issues of stuffing 100 documents in.
Prompt caching (Anthropic's prompt caching, OpenAI's caching) is a huge deal for long-context patterns where the context doesn't change between calls (a large system prompt, a shared knowledge base). Cache hit reduces input cost to ~10% of the uncached rate and TTFB by ~300ms. If your context is stable, use it aggressively.