The team added a reranker and quality improved in the demo. In production, p99 doubled and the same missing-document complaints remained. The reranker was judging the wrong candidate set.
Rerankers are useful, but they are not a substitute for retrieval. They can reorder evidence. They cannot rank evidence that never arrived.
The framework: retrieve wide enough, rerank narrowly enough
A reranker works best when first-stage retrieval has high recall and too many weak candidates. It is the wrong first fix when filters, chunking, or embedding quality prevent the right documents from entering the candidate set.
- First-stage retrieval should optimize recall.
- Reranking should optimize ordering and precision.
- Context packing should optimize what the model sees.
- Evaluation should measure all three separately.
Candidate size is the cost lever
Reranking 200 chunks costs more than reranking 30. But reranking too few candidates can cap quality. The right number depends on query class, corpus size, latency budget, and model cost.
{
"query_id": "support-1842",
"first_stage_top_k": 80,
"reranked_top_k": 12,
"expected_doc_found_before_rerank": true,
"expected_doc_rank_after_rerank": 2,
"reranker_latency_ms": 94
}
Where rerankers help
- Hybrid search candidates with mixed lexical and vector signals.
- Long documents where several chunks are semantically close.
- Queries where the top vector result is related but not answer-bearing.
- Enterprise search where precision in the top five matters more than raw top_k.
Where rerankers hide bad design
If deleted documents appear in candidates, fix freshness. If permission filters remove most candidates, fix filtered retrieval. If exact identifiers fail, fix hybrid search. A reranker can improve ordering, but it should not be the layer that compensates for broken product rules.
The production default
Add reranking after you have recall metrics, query classes, and a latency budget. Measure cost per grounded answer, not only answer quality. The mistake is using a reranker as a bandage for retrieval you have not measured.
The runbook I want before this reaches production
Before I trust this design, I want a small runbook that names the failure mode, the owner, and the rollback path. Vector systems fail in ways that look like product quality problems: missing evidence, stale evidence, wrong-tenant evidence, high p99, or answers that cite weak chunks. If the team cannot tell which one happened, the system is not observable enough.
- Define a golden query set with real permissions and expected source documents.
- Track recall, result count, p95, p99, and cost by query class.
- Keep a rollback path for index, embedding model, chunking, and metadata changes.
- Test deleted, restricted, fresh, and re-embedded documents as canaries.
- Review the dashboard after every bulk import, re-embedding job, and index rebuild.
The practical standard is simple: a retrieval change should be measurable before it ships, visible while it runs, and reversible when quality drops.