RAG re-ranking strategies are the set of techniques applied between initial retrieval and LLM generation to reorder candidate passages by true relevance, typically using a cross-encoder, a hosted re-rank API, or an LLM-as-judge pass. They exist because bi-encoder retrieval optimizes for speed at the cost of precision at the top of the result list, and that gap is the largest single cause of mediocre RAG answers in mid-market production systems. Adding a re-ranker is the right move when retrieval quality has plateaued and the latency budget can absorb 200-500ms. It is the wrong move when the index is small, hybrid search is well-tuned, or no one has measured retrieval quality in the first place.
This post is for engineering leads at 50-500 person companies who have a RAG system in production, have hit a quality ceiling that no amount of prompt iteration is fixing, and are weighing whether a re-ranker is worth the latency and cost tax. It covers what re-ranking actually buys you, the realistic options as of mid-2026 (Cohere Rerank 4.0, Voyage rerank-2.5, Jina Reranker v3, BGE re-rankers), how to think about the latency and dollar cost, and the cases where the right answer is to fix retrieval first rather than bolt a re-ranker on top.
Why Bi-Encoder Retrieval Plateaus
The standard RAG retrieval stack uses a bi-encoder: an embedding model that encodes documents at ingestion time and queries at inference time, with similarity computed by cosine distance in vector space. This is fast because the document embeddings are precomputed, and the query side only requires one forward pass plus an approximate nearest-neighbor search. Bi-encoders are the right primary retrieval mechanism for almost every production RAG system. The reason is operational: precomputed embeddings make sub-50ms retrieval at million-vector scale a solved problem.
The limitation is structural. A bi-encoder represents the query and the document as independent vectors and then compares them. The model never sees the query and the document together; it has to compress everything useful about each into a single fixed-size embedding before similarity is computed. This works well for surfacing topically related content. It works less well for distinguishing the most relevant passage from the third-most relevant passage on the same topic, because the differences that matter are often in token-level interactions the bi-encoder never observes.
The practical result in production: bi-encoder retrieval at top-k = 50 frequently returns the right answer somewhere in the top 20, but not reliably at position 1 or 2. The LLM receives a context window where the most relevant passage is buried under three plausible-but-wrong neighbors, and the generated answer reflects that. Teams diagnose this as a generation problem because the symptom looks like a generation problem. It is usually a ranking problem.
What Re-Ranking Actually Buys You
Re-ranking inserts a second pass between initial retrieval and prompt assembly. The retriever returns a larger candidate set than you would normally pass to the LLM (typically 30-100 candidates instead of the final 5-10), and a more expensive model rescoring each candidate against the query produces the final ordering. The expensive model is usually a cross-encoder, but it can also be a hosted re-rank API or an LLM call.
The relevant measurement is the difference in NDCG@10 or MRR between the bi-encoder ranking and the re-ranked ordering on your own evaluation set. Across the enterprise RAG engagements we have measured directly (n = 14 production deployments at 50-500 person companies between 2024 and early 2026), a well-chosen re-ranker moves NDCG@10 from the 0.55-0.70 range up to the 0.75-0.85 range. That is a meaningful gap, and the range is tight enough across engagements that we treat it as a reliable expectation rather than a best case. It usually translates downstream into noticeably fewer hallucinations and noticeably more "the model answered what I actually asked" feedback from users.
Re-ranking also reconciles the semantic-versus-keyword tension that haunts hybrid search. A typical hybrid setup runs BM25 in parallel with dense retrieval and fuses the two result lists with reciprocal rank fusion or a learned weighting. The fusion gets you a candidate pool that covers both lexical and semantic matches, but the ordering inside that pool is a compromise. A cross-encoder re-ranker reads the query and each candidate together and produces a single relevance score that does not care whether the candidate was surfaced by lexical or semantic match. This is the cleanest way we have found to get hybrid search to work as advertised.
What re-ranking does not buy you:
- It does not fix bad chunking. A re-ranker can only rerank what retrieval surfaces. If your chunks are 1,200 tokens of mixed-topic content, the most relevant chunk is still going to be 1,200 tokens of mixed-topic content after re-ranking.
- It does not fix bad embeddings on out-of-domain content. If your bi-encoder is missing the top-50 candidates entirely because the embedding model has no representation for your jargon, re-ranking the wrong 50 candidates does nothing.
- It does not fix retrieval queries that are genuinely ambiguous. A re-ranker scores relevance against the query as submitted; it cannot resolve the ambiguity the user introduced.
Bi-Encoder vs Cross-Encoder: The Architectural Difference That Matters
The reason re-rankers improve precision is the architectural difference between bi-encoders and cross-encoders. We mention this because the bi-encoder vs cross-encoder distinction is the only piece of theory you actually need to make good decisions about your retrieval stack.
A bi-encoder takes the query and document separately, encodes each into a vector, and compares the vectors. The query encoding and the document encoding are independent. This is what makes pre-computation possible.
A cross-encoder takes the query and document together as a single concatenated input and produces a single relevance score. The model attends across query tokens and document tokens jointly, capturing interaction features that no fixed embedding can express. This is more accurate. It is also far more expensive: you cannot precompute, so every (query, candidate) pair requires a forward pass at inference time.
This is why cross-encoders are used for re-ranking rather than for primary retrieval. Running a cross-encoder over your full million-vector corpus per query is computationally infeasible. Running it over 50 candidates returned by a fast bi-encoder is bounded and predictable. The two architectures are complements, not alternatives.
The Re-Ranker Options as of Mid-2026
The market has consolidated around four practical paths for mid-market teams. We have used all four in production engagements and have honest positioning on each.
Cohere Rerank (hosted API). Cohere has been the most-deployed managed re-rank option since the v2 and v3 models landed, and the current production line is Rerank 4.0 (rerank-v4.0-pro and rerank-v4.0-fast); Rerank 3.5 is now the prior generation. Most teams reach for Cohere first because the integration cost is the lowest of any option. You send a query and a list of candidates; you get back scores and an ordering. Latency on 50 candidates of roughly 300-500 tokens each typically lands at 200-400ms in our measurements against the Fast variant.
One pricing caveat worth getting right: Cohere's headline self-serve rate has shifted. Rerank is now offered primarily through Model Vault hourly tiers (for example, on the order of $5/hour for a Rerank 4 Fast Medium deployment), with per-search-unit pricing moved to contact-sales for enterprise contracts. The legacy/standard-tier reference rate of $0.001-$0.003 per search unit is still a useful mental model for sizing budgets, but treat it as a reference point rather than a current quoted price. Confirm with Cohere directly. Definition matters here too: a Cohere search unit is one query against up to 100 documents, and documents longer than 500 tokens are chunked, with each chunk counting as a separate document. If your candidates run long, your search-unit count grows accordingly. See Cohere's Rerank docs for current model versions and the chunking definition. This is the option we usually recommend for teams of 2 ML/AI engineers who need quality improvement without taking on another model to operate.
Voyage AI Rerank. Voyage has been an aggressive challenger in both the embedding and re-rank market. The current production models are rerank-2.5 and rerank-2.5-lite; rerank-2 is now legacy. They have held their own against Cohere on most public retrieval benchmarks we trust, and the pricing remains competitive. The honest positioning: if you are already using Voyage embeddings, the Voyage re-ranker is the path of least resistance and produces tightly compatible scoring. If you are using OpenAI or Cohere embeddings, Cohere Rerank is the safer first choice because Cohere's ecosystem is broader and the model has been pressure-tested in more production systems we have seen. The quality gap is small enough that we would not move a working Cohere deployment to Voyage on quality grounds alone. See Voyage's reranker docs for current model details.
Jina Reranker (open weights, hosted available). Jina publishes open-weight re-ranking models and also hosts them as an API. The current production model is jina-reranker-v3, released October 2025; v2 is now the prior generation. It is competitive on quality and is the most viable open-weight option for teams that want to self-host without the complexity of managing the larger BGE models. We see Jina chosen by teams that need predictable inference cost, want the option to fine-tune later, or have data residency requirements that make a hosted API unworkable. See Jina's reranker page for the current lineup.
BGE re-rankers (open weights, self-hosted). The BAAI General Embedding (BGE) family includes a series of re-rankers that are the strongest open-weight option for teams that have the GPU capacity and ML engineering bench to operate them. The current BAAI recommendations for best performance are bge-reranker-v2-gemma and bge-reranker-v2-minicpm-layerwise, with bge-reranker-v2-m3 still a viable lighter option for teams that want lower memory footprint. BGE re-rankers running on a single A10 or L4 GPU can serve a moderate query volume with 80-200ms per re-rank pass over 50 candidates depending on the variant and batch size. The honest tradeoff: self-hosting BGE makes sense when you are at sufficient query volume that the hosted API cost would exceed the amortized cost of one GPU plus an engineer's attention, and you have the operational capacity for both. For a team of 2 ML/AI engineers, this is rarely the right call. For a team of 6+ with existing model-serving infrastructure, it usually is. See the BAAI org on Hugging Face for current model cards.
LLM-as-judge re-rank. Using a general-purpose LLM in the cheap tier (Claude Haiku 4.5, GPT-5-mini, Gemini Flash, or whichever current cheap-tier model your stack already uses) to score retrieved candidates is an option some teams reach for, especially if they already have an LLM call in the pipeline. The quality can be very good, especially for nuanced relevance judgments where domain understanding matters. The cost is the problem: scoring 50 candidates per query with a cheap-tier LLM still adds meaningful dollars at any non-trivial query volume, and the latency is rarely better than a purpose-built re-ranker. We use LLM-as-judge re-ranking mostly in low-volume internal tools where the per-query cost is irrelevant, not in customer-facing systems.
The Latency Budget Conversation You Need to Have
Before adding a re-ranker, set an explicit end-to-end latency budget for your RAG query and confirm there is room. Most production RAG systems we see have a latency budget the team has never written down. They notice when the system feels slow and look for things to optimize without a target.
Write down the budget. A typical structure for a synchronous user-facing RAG query at mid-market scale:
- Total P95 latency target: 3,000-5,000ms (anything slower starts to feel broken to users).
- Embedding generation for the query: 50-150ms.
- Vector retrieval (bi-encoder, top-50): 50-200ms depending on index size and database.
- Re-ranking pass over 50 candidates: 200-500ms for a managed API; 80-300ms for a self-hosted cross-encoder, with the lower end (80-200ms) achievable on BGE running on an A10 or L4 with batched scoring, and the higher end reflecting larger BGE variants or unbatched serving.
- Prompt assembly and LLM generation: 1,500-3,500ms depending on model and output length.
If the existing system is at P95 = 4,200ms and the user-facing target is 5,000ms, you have roughly 800ms of headroom. A re-ranker fits. If the existing system is already at P95 = 4,800ms, you do not have room for a 400ms re-ranker without either accepting degraded user experience or cutting time from elsewhere — usually by switching to a faster generation model, reducing output length, or running retrieval and re-ranking in parallel with a portion of the prompt assembly.
For asynchronous workflows (research assistants, document-analysis pipelines, batch summarization) the latency budget is much more forgiving and re-ranking is almost always a free addition from a user-experience standpoint.
The Honest Cost Math at Scale
Re-ranking 50 candidates per query is not free. At very low volume the cost is invisible; at production volume it is a budget line.
A rough calculation against the legacy/standard-tier reference rate of $0.001-$0.003 per search unit, sized so that each candidate fits inside Cohere's 500-token-per-document boundary (which keeps one query against 50 candidates inside a single search unit): re-ranking 50 candidates per query puts you at $1-$3 per 1,000 queries. At 100,000 queries per month, that is $100-$300. At 1 million queries per month, that is $1,000-$3,000. If your candidates exceed 500 tokens each, they will be chunked and each chunk counts separately — a 1,200-token candidate becomes three documents for billing purposes, and your effective cost roughly triples. Also note: with Cohere's current Model Vault hourly pricing, a Rerank 4 Fast Medium deployment at the indicative $5/hour rate runs about $3,600/month at full utilization, which becomes the more relevant ceiling for committed-capacity buyers. Verify current pricing with Cohere; the per-unit figures above are useful for ballparking, not for procurement.
These numbers are not catastrophic. They are also not negligible at the scale where mid-market RAG systems often sit, and they grow linearly with query volume.
Two levers to manage cost without sacrificing the quality gain:
- Reduce the candidate set. Re-ranking the top 50 candidates from initial retrieval is the common default; re-ranking the top 20 is often nearly as good and costs less than half as much. Measure NDCG@10 at top-50 versus top-20 re-rank on your own evaluation set and choose the smallest candidate set that preserves the quality gain.
- Cache aggressively. Queries with high semantic similarity to recent queries can reuse the re-rank result alongside the cached generation. This intersects with the broader semantic-caching discussion we covered in our piece on LLM API cost optimization in production.
When Re-Ranking Does Not Help
The cases where adding a re-ranker is the wrong move, in our experience:
- Small indexes. If your index is under 10,000 chunks, top-k retrieval is usually surfacing the same set of relevant candidates regardless of ranking quality. The LLM context window can absorb a less precisely ordered top-10 without a noticeable hit to answer quality. Fix chunking and embeddings first; re-ranking will not be the binding constraint.
- Well-tuned hybrid search on a curated corpus. Teams that have invested in BM25 plus dense retrieval with reciprocal rank fusion and metadata filtering, on a corpus that has been deliberately structured, often find re-ranking adds marginal improvement at meaningful cost. The diminishing returns curve is real.
- No measured retrieval quality baseline. If you have not run NDCG@10 or MRR against an evaluation set on your current retrieval pipeline, you do not yet know whether retrieval is the bottleneck. We have audited several systems where the team was sure retrieval was broken and the actual problem was a chunking strategy that fragmented relevant context across multiple chunks, or an embedding model mismatch with the domain. A re-ranker would have improved nothing.
- Latency budgets that are already exhausted. If your P95 is already at or beyond what users tolerate, adding 300-500ms makes things worse before re-ranking quality has a chance to help. Fix the latency problem first or run re-ranking on a smaller candidate set.
The general principle: re-ranking compounds with a healthy retrieval stack. It does not substitute for one.
How We Sequence This in a Mid-Market Engagement
For teams of 2 ML/AI engineers with a production RAG system showing quality issues, our typical sequence:
- Build or extend the retrieval evaluation set. Aim for 100-200 query-answer pairs that reflect real user intent, with ground-truth relevant chunks identified. We cover this in our piece on building RAG agents for enterprise.
- Measure NDCG@10 and MRR on the current retrieval pipeline. This is the baseline. Without it, no later change can be evaluated honestly.
- Diagnose where the quality is being lost. If recall@50 is already at 0.85+ but NDCG@10 is at 0.55, the ranking is the problem and a re-ranker is the right next step. If recall@50 is at 0.60, retrieval is missing the right candidates and a re-ranker cannot help; the work is in embeddings, chunking, or hybrid search.
- If re-ranking is the right intervention, start with Cohere Rerank 4.0 Fast for the integration speed, validate the quality gain on the evaluation set, then revisit whether self-hosting (Jina Reranker v3 or BGE) makes economic sense at your query volume.
- Measure the latency impact at P50, P95, P99 and confirm the user experience remains within budget.
For teams of 6+ with existing model-serving infrastructure, the same sequence applies but the option to start with a self-hosted BGE re-ranker becomes more credible because the operational cost is amortized across infrastructure that already exists.
The supporting decisions about vector database, embedding model, and chunking strategy are covered in our companion piece on RAG architecture and vector database selection for production. The re-ranker sits on top of those decisions; it does not substitute for them.
Frequently Asked Questions
How much improvement should we realistically expect from adding a re-ranker?
Across the enterprise RAG engagements we have measured directly (n = 14 production deployments at 50-500 person companies), NDCG@10 improvements of 0.10-0.20 are typical when moving from bi-encoder-only retrieval to bi-encoder plus a competitive cross-encoder re-ranker. MRR improvements are usually in the same range. Downstream, that translates to fewer hallucinations and noticeably better answer relevance, but the size of the user-perceived improvement depends on how much your generation quality was actually being bottlenecked by ranking. If retrieval was already good enough for the LLM to find the right context, the re-rank gain will be smaller. If retrieval was the bottleneck, the gain will be obvious.
Cohere Rerank versus self-hosted BGE — when does self-hosting pay off?
We use a rough threshold of around 500,000 queries per month as the point where self-hosting BGE on a single GPU starts to be cheaper than a hosted re-rank API. The math, at the legacy per-search-unit reference rate: 500,000 queries per month at $0.002 per search unit (mid-range) is roughly $1,000/month in API spend, assuming candidates fit inside the 500-token chunk boundary so each query equals one search unit. A single A10 or L4 instance on a major cloud runs roughly $400-$700/month on-demand and less reserved, plus engineering attention to operate it. The crossover is fuzzy and depends on candidate length, batch efficiency, and what your team's time is worth — if your candidates run longer than 500 tokens and chunk into 2-3 documents each, the crossover drops well below 500,000 queries per month. Below the crossover, the engineering time to operate the GPU and monitor the model is rarely worth the API cost saved. Above it, the math tilts. Teams that anticipate scaling well past a million queries per month often start with Cohere Rerank for the first six months to ship the quality improvement, then migrate to self-hosted once the volume justifies it.
Is a cross-encoder re-ranker actually better than asking an LLM to rank the candidates?
In our measurements, purpose-built cross-encoders match or beat general-purpose LLMs in the cheap tier at relevance ranking for the same latency and at a fraction of the cost. The cases where LLM-as-judge re-ranking wins are nuanced relevance judgments that require domain reasoning rather than semantic matching — "which of these legal clauses is most relevant given the specific contract context" rather than "which passage is most similar to the query." For most enterprise search and document-Q&A use cases, a cross-encoder is the better tool.
Do we need to change our vector database to support re-ranking?
No. Re-ranking sits between retrieval and prompt assembly and is independent of the vector database. Pinecone, Qdrant, Weaviate, pgvector — they all return a candidate list that you then pass to whatever re-ranker you choose. The implementation is a separate API call or model invocation after the vector search returns.
How do we measure retrieval quality without ground-truth labels?
Start with a small hand-labeled evaluation set (50-200 query-answer pairs with relevant chunks identified) built by your subject matter experts. This is the most expensive part of the work and the part teams most often try to skip. For ongoing measurement at scale, LLM-as-judge relevance scoring against retrieved chunks is a reasonable proxy that lets you track NDCG@10 trends on production traffic without building a labeled set for every query. The hand-labeled set remains the source of truth for prompt and model changes; the LLM-as-judge proxy is for production monitoring.
What about reciprocal rank fusion — is that a re-ranking strategy?
Reciprocal rank fusion (RRF) is a fusion strategy for combining multiple ranked lists (typically BM25 and dense retrieval). It is not a re-ranking strategy in the sense this post uses the term, because it does not introduce new relevance information; it just combines existing rankings. RRF is a strong default for hybrid search and complements re-ranking rather than substituting for it. A typical strong stack is BM25 plus dense retrieval, fused with RRF, then re-ranked by a cross-encoder on the fused top-50.
If you have a production RAG system that has plateaued on quality and you want a review of where the bottleneck actually is before you add another component, our AI enablement practice runs retrieval audits for mid-market engineering teams. We will tell you honestly whether a re-ranker is the right next investment or whether the work is somewhere else in the stack.

