Reranking is the cheapest, highest-ROI RAG optimization available to teams past the prototype stage. Cost of a reranker: fractions of a cent per query. Quality improvement: often 15-30% on downstream answer quality. If you're running RAG without a reranker, you're leaving significant quality on the table. This post covers the reranker options, when each fits, and the deployment patterns that keep latency acceptable.
Why rerank?
Embedding similarity (bi-encoder) is fast but approximate. It misses relevance nuances because each text is encoded independently — the embedding doesn't know which query it's being compared to.
Cross-encoders (rerankers) look at query and document together. Much better relevance judgments because the model sees the interaction. Trade-off: much slower per comparison, so not feasible over millions of candidates — only over the handful returned by bi-encoder.
Standard pattern: bi-encoder retrieves 20-50 candidates (fast over large index); cross-encoder reranks to final 3-5 (accurate but slow per pair). Best of both.
Reranker options in 2026
Cohere Rerank 3. Commercial API. Consistent quality across domains. ~50ms for 20 candidates. Cost a few cents per thousand queries. Good default for teams that prefer managed.
BGE Reranker v2. Open source (BAAI). Self-hostable. Large, base, and minimal variants for different latency budgets. Quality competitive with Cohere at zero per-query cost once hosted.
Voyage Rerank. Commercial API. Strong on technical documents and code. Competitive with Cohere; worth benchmarking on your specific domain.
Cross-encoder from sentence-transformers. Self-hostable, open source. Older but still effective for smaller scale and English-centric use cases.
LLM-as-reranker. For very high-stakes queries, use a frontier LLM to score relevance. Highest quality, highest cost. Reserve for top-K re-scoring in critical paths.
Integration patterns
Sequential: retrieve, rerank, respond. Straightforward; adds reranker latency to critical path. Fine for latency budgets of 500ms+.
Parallel: initiate retrieval and rerank setup concurrently. Saves a small amount of latency. Not worth it for most systems.
Threshold-based. Only rerank when retrieval confidence is low. If top candidates have high similarity scores, skip reranking. Saves latency on easy queries.
How many candidates to rerank?
Retrieve 20-50 candidates to give the reranker room to work. Fewer than 20 and the reranker rarely changes the ranking meaningfully. More than 100 and latency suffers without quality gain.
Rerank to 3-10 for LLM context. Modern LLMs handle 5-10 chunks well; beyond 10 risks 'lost in the middle' effects. See context window engineering post.
Latency management
Cross-encoder inference is sequential per pair. For 50 candidates, this is 50 model runs. On GPU, this takes 50-200ms depending on model size and batch support. On CPU, 200-1000ms.
Use smaller reranker models for latency-critical paths. bge-reranker-base is 3-5x faster than bge-reranker-large with moderately worse quality.
Batch reranking. Most libraries support batching candidates — send all 50 pairs at once. Significantly reduces overhead vs 50 individual calls.
Evaluating reranker impact
Retrieval metrics: NDCG@k, recall@k, MRR. With and without reranker, on your eval set.
End-to-end metrics matter more. Answer quality on downstream LLM responses, user thumbs-up rate, task completion rate. Reranking that improves retrieval metrics but not end-to-end quality isn't worth the latency. See eval post.
Domain-specific fine-tuning
For highly specialized domains (medical, legal, code), fine-tuning a reranker on domain data can add another 10-20% quality. Requires labeled query-document relevance pairs; can be bootstrapped from LLM judgments.
Only worth it at scale. For most teams, off-the-shelf rerankers are good enough. See fine-tuning post.