A support assistant failed on the simplest query in the system: an error code. The vector search found conceptually similar troubleshooting pages. The exact page with that error code ranked lower because semantic similarity did not understand that the code was the query.
Hybrid search exists because users ask both kinds of questions. Sometimes they want meaning. Sometimes they want the exact string. Production search needs to respect both.
The framework: classify query intent before blending
I do not start hybrid search by choosing a blending formula. I start by classifying failure modes. Product names, SKUs, table names, stack traces, customer IDs, and error codes need lexical precision. Natural-language questions, summaries, and vague descriptions need semantic recall.
- Exact identifiers should be protected from being buried by semantic results.
- Semantic matches should rescue users who do not know the exact wording.
- Rerankers should improve a candidate set, not hide poor retrieval.
- Evaluation needs query classes, not one global score.
A practical Postgres shape
In Postgres, hybrid search often means combining full-text search rank with vector distance. The details vary, but the principle is stable: produce candidates from both channels, normalize or fuse ranks, then evaluate against real queries.
WITH lexical AS (
SELECT id, ts_rank_cd(search_vector, plainto_tsquery('english', $1)) AS rank
FROM docs
WHERE search_vector @@ plainto_tsquery('english', $1)
LIMIT 100
),
semantic AS (
SELECT id, 1.0 / (1.0 + (embedding <=> $2)) AS rank
FROM docs
ORDER BY embedding <=> $2
LIMIT 100
)
SELECT id
FROM (
SELECT id, rank, 'lexical' AS source FROM lexical
UNION ALL
SELECT id, rank, 'semantic' AS source FROM semantic
) candidates
GROUP BY id
ORDER BY sum(rank) DESC
LIMIT 20;
RRF is a good baseline
Reciprocal rank fusion is popular because it uses relative rank instead of pretending lexical and vector scores are naturally comparable. It is a sane starting point, especially when score scales differ across engines.
It is still not magic. If one channel never retrieves the right document, fusion cannot recover it.
What I measure
- Exact identifier queries: does the exact document rank first?
- Natural language queries: does semantic recall improve over BM25 alone?
- Ambiguous queries: does the reranker choose the right intent?
- No-hit queries: does the system admit uncertainty instead of inventing confidence?
- Latency: does hybrid retrieval plus reranking fit the product p99?
The production default
Use hybrid search when the corpus mixes exact identifiers and explanatory prose. Keep separate metrics for lexical-heavy and semantic-heavy queries. The mistake is not adding vector search to BM25. The mistake is assuming one blended score understands user intent.
The runbook I want before this reaches production
Before I trust this design, I want a small runbook that names the failure mode, the owner, and the rollback path. Vector systems fail in ways that look like product quality problems: missing evidence, stale evidence, wrong-tenant evidence, high p99, or answers that cite weak chunks. If the team cannot tell which one happened, the system is not observable enough.
- Define a golden query set with real permissions and expected source documents.
- Track recall, result count, p95, p99, and cost by query class.
- Keep a rollback path for index, embedding model, chunking, and metadata changes.
- Test deleted, restricted, fresh, and re-embedded documents as canaries.
- Review the dashboard after every bulk import, re-embedding job, and index rebuild.
The practical standard is simple: a retrieval change should be measurable before it ships, visible while it runs, and reversible when quality drops.