6 min read

pgvector HNSW Tuning: Why Default Settings Quietly Kill Recall

HNSW defaults can look fast while missing useful results. Production tuning needs recall@k, p99, index build time, memory, and filtered result count together.

The most dangerous pgvector dashboard I see is green latency and unmeasured recall. Queries are fast. The assistant answers confidently. Nobody notices that the best chunk is missing until a customer asks why the system ignored the document they just uploaded.

HNSW tuning is not about making vector search fancy. It is about choosing the amount of graph work you are willing to pay for at build time and query time so the product returns good enough results.

The framework: tune for quality first, then cost

I tune HNSW with four numbers on the same page: recall@k, p95/p99 latency, index size, and build time. If you only track latency, the fastest configuration will often be the wrong one.

  • m: graph connectivity. Higher can improve recall but increases index size and build cost.
  • ef_construction: build-time exploration. Higher can improve graph quality but slows index creation.
  • hnsw.ef_search: query-time exploration. Higher often improves recall but increases latency.
  • iterative scans: useful when filters remove too many approximate candidates.

Use exact search as the reference

A useful benchmark compares the approximate HNSW result to exact search on the same sample and same filters. Exact search is not the production plan for large tables, but it gives you a truth set.

-- Exact reference for sampled evaluation.
SET enable_indexscan = off;

SELECT id
FROM document_chunks
WHERE organization_id = 42
  AND is_deleted = false
ORDER BY embedding <=> $1
LIMIT 20;

Then test the production path

Run the same query through the HNSW index with realistic filters and concurrency. Change one setting at a time and record recall@20, p99, buffers, and result count.

SET hnsw.ef_search = 120;

EXPLAIN (ANALYZE, BUFFERS)
SELECT id
FROM document_chunks
WHERE organization_id = 42
  AND is_deleted = false
ORDER BY embedding <=> $1
LIMIT 20;

Filtered search changes the tuning target

A setting that works for global search can fail under tenant or ACL filters. If the approximate index finds many candidates that filters later reject, the query may return too few rows or lower-quality rows. That is why pgvector's filtered-search behavior and iterative scan settings matter for production RAG.

The rollout I trust

The mistake is not using HNSW defaults. The mistake is never proving whether those defaults meet the product's quality bar.

  1. Build a golden query set from real product searches.
  2. Measure exact search on a representative sample.
  3. Benchmark multiple HNSW configurations.
  4. Pick a target recall floor before optimizing latency.
  5. Build new indexes concurrently or in a shadow table.
  6. Cut over with rollback SQL ready.

The runbook I want before this reaches production

Before I trust this design, I want a small runbook that names the failure mode, the owner, and the rollback path. Vector systems fail in ways that look like product quality problems: missing evidence, stale evidence, wrong-tenant evidence, high p99, or answers that cite weak chunks. If the team cannot tell which one happened, the system is not observable enough.

  • Define a golden query set with real permissions and expected source documents.
  • Track recall, result count, p95, p99, and cost by query class.
  • Keep a rollback path for index, embedding model, chunking, and metadata changes.
  • Test deleted, restricted, fresh, and re-embedded documents as canaries.
  • Review the dashboard after every bulk import, re-embedding job, and index rebuild.

The practical standard is simple: a retrieval change should be measurable before it ships, visible while it runs, and reversible when quality drops.