May 4, 20266 min read

Hybrid Search in Production: BM25 Plus Vector Is Not a Magic Button

Hybrid search helps when exact terms and semantic meaning both matter. It fails when teams blend rankings without query intent, calibration, or evaluation.

A support assistant failed on the simplest query in the system: an error code. The vector search found conceptually similar troubleshooting pages. The exact page with that error code ranked lower because semantic similarity did not understand that the code was the query.

Hybrid search exists because users ask both kinds of questions. Sometimes they want meaning. Sometimes they want the exact string. Production search needs to respect both.

The framework: classify query intent before blending

I do not start hybrid search by choosing a blending formula. I start by classifying failure modes. Product names, SKUs, table names, stack traces, customer IDs, and error codes need lexical precision. Natural-language questions, summaries, and vague descriptions need semantic recall.

Exact identifiers should be protected from being buried by semantic results.
Semantic matches should rescue users who do not know the exact wording.
Rerankers should improve a candidate set, not hide poor retrieval.
Evaluation needs query classes, not one global score.

A practical Postgres shape

In Postgres, hybrid search often means combining full-text search rank with vector distance. The details vary, but the principle is stable: produce candidates from both channels, normalize or fuse ranks, then evaluate against real queries.

WITH lexical AS (
  SELECT id, ts_rank_cd(search_vector, plainto_tsquery('english', $1)) AS rank
  FROM docs
  WHERE search_vector @@ plainto_tsquery('english', $1)
  LIMIT 100
),
semantic AS (
  SELECT id, 1.0 / (1.0 + (embedding <=> $2)) AS rank
  FROM docs
  ORDER BY embedding <=> $2
  LIMIT 100
)
SELECT id
FROM (
  SELECT id, rank, 'lexical' AS source FROM lexical
  UNION ALL
  SELECT id, rank, 'semantic' AS source FROM semantic
) candidates
GROUP BY id
ORDER BY sum(rank) DESC
LIMIT 20;

RRF is a good baseline

Reciprocal rank fusion is popular because it uses relative rank instead of pretending lexical and vector scores are naturally comparable. It is a sane starting point, especially when score scales differ across engines.

It is still not magic. If one channel never retrieves the right document, fusion cannot recover it.

What I measure

Exact identifier queries: does the exact document rank first?
Natural language queries: does semantic recall improve over BM25 alone?
Ambiguous queries: does the reranker choose the right intent?
No-hit queries: does the system admit uncertainty instead of inventing confidence?
Latency: does hybrid retrieval plus reranking fit the product p99?

The production default

Use hybrid search when the corpus mixes exact identifiers and explanatory prose. Keep separate metrics for lexical-heavy and semantic-heavy queries. The mistake is not adding vector search to BM25. The mistake is assuming one blended score understands user intent.

The runbook I want before this reaches production

Before I trust this design, I want a small runbook that names the failure mode, the owner, and the rollback path. Vector systems fail in ways that look like product quality problems: missing evidence, stale evidence, wrong-tenant evidence, high p99, or answers that cite weak chunks. If the team cannot tell which one happened, the system is not observable enough.

Define a golden query set with real permissions and expected source documents.
Track recall, result count, p95, p99, and cost by query class.
Keep a rollback path for index, embedding model, chunking, and metadata changes.
Test deleted, restricted, fresh, and re-embedded documents as canaries.
Review the dashboard after every bulk import, re-embedding job, and index rebuild.

The practical standard is simple: a retrieval change should be measurable before it ships, visible while it runs, and reversible when quality drops.

← Back to all articles