Every RAG tutorial on the internet shows the same thing: split your docs into chunks, embed them, put them in a vector database, retrieve top-k on each query, feed the results to the LLM, return the answer. This works fine for a demo. It fails for production in ways that are specific, repeatable, and fixable once you know the patterns.

Over the last two years we've deployed RAG in production for dozens of clients — internal knowledge bases, customer support copilots, legal document QA, developer documentation assistants. Across those deployments, six patterns repeatedly make the difference between a RAG system people actually use and one that gets quietly abandoned. This post walks through each one: what it is, when to use it, when not to.

Pipeline

A production RAG pipeline we deploy for most clients: query rewrite → hybrid retrieval → rerank → parent-document fetch → LLM answer. Total latency 2-4s, retrieval quality consistently above 85%.

Pattern 1: Hybrid search (BM25 + vectors)

Pure vector search loses at exact matches. A user types 'error code E-5071' and the vector search retrieves documents that are semantically similar to error codes in general but miss the specific document that mentions E-5071 verbatim. This happens constantly in product documentation, legal contracts, code bases, and anywhere exact terminology matters.

The fix: combine BM25 (classical keyword search) with vector search using reciprocal rank fusion or a weighted sum. BM25 catches exact matches; vector search catches semantic matches; fusion gives you both. On our internal benchmarks, hybrid search outperforms pure vector search on 70% of queries at equal cost. See the dedicated hybrid search deep dive for tuning recommendations.

Use hybrid whenever your corpus has meaningful exact-match needs: product codes, legal terminology, function names, customer IDs. Skip hybrid for pure semantic search tasks like 'find articles about this theme' where exact matches don't exist.

Pattern 2: Query rewriting

Users type short, ambiguous queries. 'Refund policy' could mean 'what is the refund policy,' 'how do I request a refund,' or 'when did we last update the refund policy.' A single embedding of the short query loses against documents that state the policy in full. The fix: before retrieval, use an LLM to rewrite the query into a more complete form.

Implementation: a cheap model (GPT-4o-mini, Claude Haiku) takes the conversation context and the user query, and returns an expanded search query. 'Refund policy' after context might become 'current company refund policy for enterprise customers including eligibility windows and approval process.' This expanded query retrieves much more reliably.

Cost: one small LLM call per query. Latency: 100-300ms added. Quality improvement on our benchmarks: 15-25 percentage points on recall. Worth it for any system with significant multi-turn conversation or short queries. Skip for systems where users naturally write full-sentence queries (advanced search interfaces, internal research tools).

Pattern 3: Reranking

Vector search returns top-k by embedding similarity. Embedding similarity is a coarse signal — it correlates with relevance but isn't relevance itself. Rerankers are models specifically trained to score (query, passage) pairs for relevance. Running the top 50 candidates from vector search through a reranker and keeping the top 5 produces dramatically better results than top-5 from vector alone.

Models to use: Cohere Rerank is the commercial standard. Open-source options include BGE-reranker-v2-m3 and Jina Reranker. All three work well; Cohere has better multilingual support, BGE is free and self-hostable.

Cost: a few cents per 1000 queries for commercial rerankers; GPU cost for self-hosted. Latency: 100-400ms. Quality improvement: 20-40 percentage points on precision@5. This is one of the highest-ROI pattern additions to any RAG system.

Pattern 4: Parent-document retrieval

Small chunks embed well and retrieve precisely. Small chunks also give the LLM too little context to answer well. Parent-document retrieval resolves this tension: embed and search over small chunks (say, 200 tokens), but when a chunk matches, retrieve the larger parent document or section (say, 2000 tokens) to feed to the LLM.

Concrete example: for legal contracts, embed each clause as a chunk. When retrieval matches a clause, fetch the full section (intro + all clauses + cross-references) and feed that to the LLM. This gives precise retrieval plus rich context — you get the match granularity of small chunks with the context completeness of large chunks.

Implementation: store a parent_id on each chunk pointing to its larger section. After retrieval, deduplicate by parent_id and fetch the parents. More on chunking strategy in our chunking patterns post.

Pattern 5: Metadata filtering (pre or post-retrieval)

A user asks about 'the 2024 revenue report.' The corpus has reports from 2019-2025. Pure semantic similarity will retrieve all of them — they're all semantically similar to each other — and the LLM will get confused. The fix: extract structured filters from the query (year = 2024) and apply them as a filter on the vector search.

Two implementation modes: pre-filter applies the metadata constraint before semantic search, narrowing the candidate set. Post-filter semantic-searches everything, then filters. Pre-filter is more accurate but requires your vector DB to support efficient metadata filtering (Pinecone, Qdrant, Weaviate all do; many others don't). Post-filter works anywhere but wastes retrieval bandwidth.

The filter extraction itself is an LLM call: given the user query, extract structured metadata filters. This is cheap, fast, and transformative for any corpus with meaningful structure (time, author, product, category, tenant).

Pattern 6: Query routing (multi-index)

Different queries need different corpora. A user of a customer support copilot might ask about product features (docs corpus), pricing (policy corpus), or an outage (status corpus). Running every query against every corpus wastes retrieval bandwidth and dilutes results. The pattern: classify the query into a corpus, then retrieve only from that corpus.

Implementation: a small classifier LLM call routes each query. Simple routing rules (regex, keywords) work for 70% of queries; an LLM classifier handles the rest. Quality improvement: 10-30 percentage points on recall for systems with heterogeneous corpora.

This is the foundation of agent-style RAG, where the LLM itself decides which retrieval tool to call. See our agents in production post for the full pattern.

Putting it all together

A production RAG pipeline we've deployed for a large SaaS client combines four of the six patterns: query rewriting, hybrid search, reranking, parent-document retrieval. The other two (metadata filtering, query routing) are used on a subset of queries. The pipeline looks like this:

User query arrives. LLM rewrites for retrieval (100ms).
Rewritten query runs hybrid search: BM25 and vector against the current index in parallel (50ms).
Top 50 candidates combined via reciprocal rank fusion.
Reranker scores all 50 candidates (200ms).
Top 5 chunks retrieved; parent documents fetched (20ms).
Parent documents + original query feed to the answer LLM (1.5-3s).
Response streamed to user.

Total latency: 2-4 seconds. Retrieval quality: consistently above 85% on our eval set. Cost: roughly 2-3x a naive top-k RAG, dominated by reranking and query rewriting LLM calls. Worth every cent.

When to use which pattern

If you can only ship one: reranking. If two: reranking + hybrid search. If three: add query rewriting. Past three: parent-document and metadata filtering. Query routing is only worth it when you have genuinely heterogeneous corpora.

Anti-patterns to avoid

Top-k = 20 by default. More retrieved chunks usually hurt, not help. The LLM gets confused by noise. Start at 5 and tune upward only with evidence.
Re-indexing whenever a document changes. Incremental indexing is a solved problem; solve it.
Shipping without evals. Without measurement, every change is a guess. See eval infrastructure post.
One-size-fits-all chunking. Code needs different chunking than prose. Structured docs need different chunking than free text.
Ignoring metadata. If your docs have useful metadata (timestamps, authors, categories), using them in filtering will 2x your retrieval quality with modest effort.

What we see most in client engagements

The most common RAG failure we're called in to fix: teams built naive top-k RAG, shipped it, saw 60% retrieval quality on real queries, concluded 'RAG doesn't work,' and started asking about fine-tuning. Fine-tuning does not fix bad retrieval. Adding the six patterns above, in order, takes retrieval quality from 60% to 85%+ in almost every case. That's the intervention that pays off.

Before considering fine-tuning, read our post on when to fine-tune. In 80% of cases, the answer is 'after you've applied these six patterns, not before.'

RAG is 80% retrieval engineering and 20% prompting. Teams invert that ratio and wonder why their RAG is bad.

Closing

These six patterns compose. Each adds 10-25 points of retrieval quality at modest cost. Stack them thoughtfully for your corpus shape. Pair with solid evaluation infrastructure so you can measure each pattern's contribution. For engagements where we build production RAG — see our BrightStack case study for a full example — this stack is the default starting point.

Six RAG patterns that actually work in production

Pattern 1: Hybrid search (BM25 + vectors)

Pattern 2: Query rewriting

Pattern 3: Reranking

Pattern 4: Parent-document retrieval

Pattern 5: Metadata filtering (pre or post-retrieval)

Pattern 6: Query routing (multi-index)

Putting it all together

Anti-patterns to avoid

What we see most in client engagements

Closing

Continue the thread.

Hybrid search: why pure vector search isn't enough

Chunking strategies: the unglamorous key to RAG quality

Why evaluation infrastructure matters more than prompts

Want to talk about this?