Multi-hop retrieval handles questions whose answer requires connecting information across multiple retrievals. 'Which CEOs of companies I invested in last year have since left?' is not answerable from one retrieval — it's first the list of investments, then the current CEOs for each. Single-hop RAG fails here; multi-hop RAG iterates. This post covers the specific patterns that make multi-hop reliable and why naive implementations fail.

Multi-hop patterns

Query decomposition, iterative retrieval, self-ask prompting. Each approach trades complexity for reliability on specific question types.

Why single-hop fails

Single-hop RAG: embed the question, retrieve k chunks, pass to LLM. Works when the answer is in 1-2 chunks.

Multi-hop questions are different. The relevant information is distributed: chunk A contains part 1, chunk B (retrievable only after knowing part 1) contains part 2. Single-hop retrieval returns chunks similar to the question, not chunks containing intermediate facts.

Typical failure mode: the retrieval returns only partially relevant chunks, the LLM hallucinates to fill gaps. Looks like a hallucination problem; is actually a retrieval problem.

Multi-hop patterns

Query decomposition. LLM rewrites the complex question into multiple simpler sub-questions. Retrieve for each; aggregate context; LLM synthesizes. Works well when decomposition is reliable. See query rewriting post.

Iterative retrieval (self-ask). LLM asks itself: 'what do I need to know?' Retrieves. Then: 'given that, what else?' Retrieves again. Continues until LLM signals it has enough. Powerful but expensive (multiple retrievals + LLM calls per question).

Graph traversal. If you have a knowledge graph, follow edges. Works cleanly for relational questions. See knowledge graph RAG post.

Chain-of-thought with retrieval. Prompt LLM to reason step by step; after each step, retrieve relevant context; feed back in. Variant of self-ask with more explicit reasoning.

Implementation details

Bounded iteration. Cap at 3-5 hops. Unbounded loops produce unreliable results and runaway cost. If a question can't be answered in 5 hops, return partial answer with explanation.

Context accumulation. Each hop adds to the LLM's context. Be careful of context window pressure. Summarize or prune earlier hops when context gets too long.

Signal to stop. LLM must recognize when it has enough information. Explicit prompt: 'if you can answer the question, say DONE; otherwise indicate what you still need.' Without this, loops continue accumulating irrelevant retrievals.

Parallelization. Independent sub-questions can be retrieved in parallel. 'CEO of company X and company Y' is two retrievals that can happen simultaneously. Reduces latency significantly for decomposable questions.

Quality considerations

Error compounds. Each hop has some probability of retrieving wrong chunks. Over 3-4 hops, compounded error is significant. A system that's 90% reliable per hop is 59% reliable over 5 hops.

Mitigation: better retrieval per hop. Hybrid retrieval (see hybrid search post), reranking (see reranking post), and context-specific query reformulation all help.

Confidence scores. Track LLM confidence per hop. Low confidence suggests the retrieval failed; trigger alternative retrieval strategy or give up gracefully.

When to invest in multi-hop

Question distribution analysis. Sample 100+ real user questions. Categorize: single-hop, multi-hop, ambiguous. If multi-hop is >20% of traffic, invest. If <5%, don't bother.

Domain characteristics. Legal, medical, investigative domains have high multi-hop incidence. Customer support, basic Q&A tend to be single-hop.

Business impact. Multi-hop questions often have higher user value. A user asking a single-hop factoid gets value from plain RAG; a user asking a multi-hop analytical question is likely a higher-value user. Invest accordingly.

Evaluating multi-hop systems

Standard eval sets don't capture multi-hop well. Purpose-built eval datasets: HotpotQA, 2WikiMultiHopQA exist publicly. Build your own from user question samples.

Measure at each hop: retrieval precision (did we get relevant chunks?), synthesis accuracy (did we use them correctly?), final answer correctness. Debugging is easier with per-hop instrumentation. See eval infrastructure post.

Multi-hop retrieval: questions that span documents

Why single-hop fails

Multi-hop patterns

Implementation details

Quality considerations

When to invest in multi-hop

Evaluating multi-hop systems

Continue the thread.

Six RAG patterns that actually work in production

Knowledge graph RAG: when relations beat chunks

AI agents in production: what actually breaks

Want to talk about this?