eazyware
Engineering·July 1, 2024·10 min read

Reranking models: the cross-encoder layer that transforms RAG

Cohere Rerank, bge-reranker, Jina Reranker. The 50ms investment that often doubles retrieval quality. When and how to integrate.

KR
Kushal R.
Engineering lead

Reranking is the cheapest, highest-ROI RAG optimization available to teams past the prototype stage. Cost of a reranker: fractions of a cent per query. Quality improvement: often 15-30% on downstream answer quality. If you're running RAG without a reranker, you're leaving significant quality on the table. This post covers the reranker options, when each fits, and the deployment patterns that keep latency acceptable.

Retrieve-then-rerank
Reranking — two-stage retrieval Query vector search top 50-100 candidates ~20ms, cheap Cross-encoder rerank query-doc together high-accuracy scoring ~50ms, ~$0.001 Top 3-5 high-precision send to LLM LLM answer Reranker options Cohere Rerank — hosted, easiest integration, $2/1K requests bge-reranker-v2 — OSS, self-host, high quality, BAAI Jina Reranker — OSS alternative, multilingual strength typical quality uplift: 2x recall@5 over vector-only
Retrieve 20-50 candidates with cheap embedding search; rerank to top 3-5 with cross-encoder. Quality gain dwarfs the small latency cost.

Why rerank?

Embedding similarity (bi-encoder) is fast but approximate. It misses relevance nuances because each text is encoded independently — the embedding doesn't know which query it's being compared to.

Cross-encoders (rerankers) look at query and document together. Much better relevance judgments because the model sees the interaction. Trade-off: much slower per comparison, so not feasible over millions of candidates — only over the handful returned by bi-encoder.

Standard pattern: bi-encoder retrieves 20-50 candidates (fast over large index); cross-encoder reranks to final 3-5 (accurate but slow per pair). Best of both.

Reranker options in 2026

Cohere Rerank 3. Commercial API. Consistent quality across domains. ~50ms for 20 candidates. Cost a few cents per thousand queries. Good default for teams that prefer managed.

BGE Reranker v2. Open source (BAAI). Self-hostable. Large, base, and minimal variants for different latency budgets. Quality competitive with Cohere at zero per-query cost once hosted.

Voyage Rerank. Commercial API. Strong on technical documents and code. Competitive with Cohere; worth benchmarking on your specific domain.

Cross-encoder from sentence-transformers. Self-hostable, open source. Older but still effective for smaller scale and English-centric use cases.

LLM-as-reranker. For very high-stakes queries, use a frontier LLM to score relevance. Highest quality, highest cost. Reserve for top-K re-scoring in critical paths.

Integration patterns

Sequential: retrieve, rerank, respond. Straightforward; adds reranker latency to critical path. Fine for latency budgets of 500ms+.

Parallel: initiate retrieval and rerank setup concurrently. Saves a small amount of latency. Not worth it for most systems.

Threshold-based. Only rerank when retrieval confidence is low. If top candidates have high similarity scores, skip reranking. Saves latency on easy queries.

How many candidates to rerank?

Retrieve 20-50 candidates to give the reranker room to work. Fewer than 20 and the reranker rarely changes the ranking meaningfully. More than 100 and latency suffers without quality gain.

Rerank to 3-10 for LLM context. Modern LLMs handle 5-10 chunks well; beyond 10 risks 'lost in the middle' effects. See context window engineering post.

Latency management

Cross-encoder inference is sequential per pair. For 50 candidates, this is 50 model runs. On GPU, this takes 50-200ms depending on model size and batch support. On CPU, 200-1000ms.

Use smaller reranker models for latency-critical paths. bge-reranker-base is 3-5x faster than bge-reranker-large with moderately worse quality.

Batch reranking. Most libraries support batching candidates — send all 50 pairs at once. Significantly reduces overhead vs 50 individual calls.

Evaluating reranker impact

Retrieval metrics: NDCG@k, recall@k, MRR. With and without reranker, on your eval set.

End-to-end metrics matter more. Answer quality on downstream LLM responses, user thumbs-up rate, task completion rate. Reranking that improves retrieval metrics but not end-to-end quality isn't worth the latency. See eval post.

Domain-specific fine-tuning

For highly specialized domains (medical, legal, code), fine-tuning a reranker on domain data can add another 10-20% quality. Requires labeled query-document relevance pairs; can be bootstrapped from LLM judgments.

Only worth it at scale. For most teams, off-the-shelf rerankers are good enough. See fine-tuning post.

Read next
Hybrid search: why pure vector search isn't enough
Read next
Six RAG patterns that actually work in production
Read next
Late-interaction retrieval: ColBERT and the middle ground
Tags
rerankingcross-encoderretrieval quality
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request