eazyware
Engineering·July 8, 2024·9 min read

HyDE and query expansion: hypothetical documents for retrieval

HyDE generates a hypothetical answer and embeds it for retrieval. Counter-intuitive but effective for vocabulary-mismatch problems.

KR
Kushal R.
Engineering lead

HyDE — Hypothetical Document Embeddings — is one of the most effective RAG techniques published in recent years. Simple idea: instead of embedding the user's question, have an LLM write a hypothetical answer, then embed that. Documents match documents better than questions match documents. The technique often improves retrieval quality by 10-20% with minor engineering cost. This post is the mechanics, the variants, and when HyDE is the right tool.

HyDE mechanics
HyDE — retrieve by hypothetical answer User query "how do I cancel?" (short, vague) LLM generates hypothetical answer "To cancel, go to..." Embed & retrieve search by fake answer (closer to real docs) Real docs Why it works Documents and questions live in different parts of embedding space. HyDE bridges. When HyDE wins: short queries, vocabulary-mismatch, technical docs with specific terminology. When HyDE hurts: domain-specific queries where LLM hypothesis is off-base — adds noise.
Traditional: embed(query) to vector DB. HyDE: LLM generates hypothetical answer; embed(hypothetical) to vector DB. Better alignment with document embedding space.

The core idea

Embedding spaces are trained on similarity between similar texts. A user question and a document are semantically similar but stylistically very different. A question has interrogative structure; a document has declarative structure. Same topic, different linguistic form.

HyDE leverages this. LLM generates a plausible answer document based only on the query. This hypothetical answer has the declarative form of a real document. Embedding it lands closer to real documents in embedding space.

Result: retrieval brings back more relevant real documents. The hypothetical answer is often wrong in details (LLM hallucinates without context), but that's fine — we don't show it to the user. We only use its embedding to retrieve real, correct documents.

Implementation

Simple. LLM call with a prompt like: 'Given the question, write a short paragraph that would be a reasonable answer. Don't worry about perfect accuracy — give your best guess.' Use a fast small model.

Embed the output. Use the embedding for vector search. Rest of the retrieval pipeline unchanged.

Latency: adds 100-300ms from the LLM call. Cost: $0.0001 per query typically. Very small price for the retrieval quality improvement.

HyDE variants

Multi-HyDE. Generate several hypothetical answers with different angles or phrasings; embed each; query with each; merge results. Recall increases at the cost of more LLM calls. Useful for questions with multiple valid angles.

HyDE + expansion. Generate one hypothetical answer; embed it; also run expansion-based retrieval on the original query. Merge results. Combines strengths of both techniques.

HyDE with structure. For domains where documents have specific structure (code, legal filings, medical records), prompt the LLM to match that structure in the hypothetical. The embedding lands even closer to real docs.

Conditional HyDE. Use HyDE only for queries the router thinks will benefit. Skip for queries the system can already handle well. Saves LLM calls.

When HyDE helps most

Question-to-document semantic gap is large. Short user questions vs long technical documents is the classic case. HyDE bridges the gap.

Domain-specific vocabulary. If documents use domain jargon users don't (medical codes, legal terminology), the LLM's hypothetical often includes the jargon, improving retrieval.

Terse query patterns. Users who enter one or two-word queries benefit most from HyDE because the original embedding has very little signal to work with.

When HyDE doesn't help

When the query is already document-like. 'Refund policy for enterprise customers' matches documents well; HyDE adds latency without improvement.

When retrieval quality is already high. If recall@5 is above 0.95, HyDE has little room to help; optimize elsewhere.

When users prefer deterministic retrieval. The hypothetical answer introduces LLM variance into retrieval. For use cases requiring reproducibility (legal, compliance), HyDE can complicate audit trails.

Pitfalls

Caching the hypothetical. Same query should produce same hypothetical (or at least same embedding). Cache the hypothetical by normalized query. See caching patterns post.

Relying on correctness. Never show the hypothetical to users. It's LLM hallucination by design. It's only an embedding-space landmark.

Eval rigor. Measure with and without HyDE on your eval set before rolling out broadly. Most teams see improvement; some see regression in specific domains. Don't assume. See eval post.

Query rewriting in general — HyDE is one member of the rewriting family. See query rewriting post. Use together: rewrite query, then HyDE, then retrieve. Compounds benefits.

Read next
Query rewriting: the retrieval upgrade most teams skip
Read next
Six RAG patterns that actually work in production
Read next
Embedding models compared: OpenAI vs Cohere vs Jina vs BGE vs Nomic
Tags
HyDEquery expansionretrieval
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request