12 min read

pgvector Filtered Search: The Production Problem Hiding Behind Good Demos

pgvector works well when you respect the hard part: filtered recall. Tenant filters, permissions, dead tuples, and re-embedding jobs decide whether RAG results stay useful.

The pgvector demo that convinces a team is usually ten lines long. Create an embedding column, add an HNSW index, order by cosine distance, ship a RAG prototype. The query is fast, the answer looks plausible, and the architecture feels refreshingly simple.

Production adds the missing nouns: tenant, user, document status, ACL group, language, freshness, deleted rows, re-embedding version, and a customer who expects the assistant to find the one paragraph they are allowed to see.

That is the real pgvector problem. Storing vectors in Postgres is not the hard part. Keeping filtered retrieval correct after approximate search, churn, and product permissions is the hard part.

The framework: measure the result, not just the query

I treat every pgvector production review as a retrieval quality review first and a database tuning review second.

The question is not only:

  • How fast is the query?
  • How large is the index?
  • How much CPU does it use?

The question is also:

  • Did it return enough rows after filters?
  • Did it return the same useful rows as exact search?
  • Which tenants or permission groups lose recall?
  • Does recall change after deletes, VACUUM, or bulk re-embedding?

If you cannot answer those, pgvector may still be fine, but you are flying by latency alone.

Why filtered recall fails

Approximate vector indexes do not inspect every row. That is why they are fast. With filters, the index may find nearby candidates first, and then Postgres applies tenant, status, permission, or language filters afterward.

If your filter matches 10% of rows and the index initially explores 40 candidates, you should not be surprised when only a handful survive. That may be fine for a recommendations widget. It may be unacceptable for an answer engine that needs evidence.

SELECT id, document_id, body
FROM document_chunks
WHERE organization_id = $1
  AND acl_group_id = ANY($2)
  AND is_deleted = false
ORDER BY embedding <=> $3
LIMIT 12;

This query is not one workload. It is many workloads hiding behind the same SQL shape. Large tenants, small tenants, narrow ACLs, broad ACLs, stale documents, and fresh documents all behave differently.

Start with an exact reference set

Before tuning HNSW, I build a small evaluation set and compare approximate results to exact results. Exact search gives you the reference. Approximate search gives you the production candidate.

-- Run this on a sampled query set, not on every production request.
BEGIN;
SET LOCAL enable_indexscan = off;

SELECT id
FROM document_chunks
WHERE organization_id = $1
  AND acl_group_id = ANY($2)
  AND is_deleted = false
ORDER BY embedding <=> $3
LIMIT 20;

ROLLBACK;

The exact reference does not need to be cheap. It needs to be honest. Use it to calculate recall@k and result count by tenant, ACL shape, and corpus size.

Tune HNSW only after you know what is missing

HNSW gives you useful knobs, but knobs without a recall target turn into superstition.

CREATE INDEX document_chunks_embedding_hnsw
ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Higher construction settings can improve recall at build-time and insert-time cost. Query-time search breadth can also improve recall, but it spends CPU and latency. The right value is not universal; it is whatever meets your product recall target under realistic concurrency.

SET hnsw.ef_search = 200;

If the product needs stable evidence retrieval, I prefer a slower query that returns the right rows over a fast query that quietly starves the model of context.

Use iterative scans for the problem they actually solve

Iterative scans can help filtered approximate queries scan more of the index until enough rows are found or a configured limit is reached.

SET hnsw.iterative_scan = strict_order;
-- or, when slight ordering looseness is acceptable:
SET hnsw.iterative_scan = relaxed_order;

This is not magic. It is a way to spend more work when filters discard too many candidates. You still need guardrails such as max scan tuples and memory limits. Otherwise a narrow permission filter can turn one user's query into a noisy neighbor for everyone else.

Index filters before you blame vectors

If a filter is selective, a normal relational index can be the best vector optimization you make.

CREATE INDEX document_chunks_acl_lookup
ON document_chunks (organization_id, acl_group_id, language)
WHERE is_deleted = false;

For a small number of dominant filter values, partial vector indexes can be practical:

CREATE INDEX document_chunks_en_active_hnsw
ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WHERE language = 'en' AND is_deleted = false;

For many values, one partial index per tenant is usually a maintenance trap. Partitioning by tenant tier, region, or corpus class is often cleaner because it reflects the shape of the workload rather than the shape of one emergency query.

Churn is the quiet recall killer

RAG data is not static. Documents get re-chunked, deleted, re-embedded, restored, and bulk imported. That means normal Postgres maintenance matters:

  • Dead tuples can reduce useful results before cleanup catches up.
  • Bulk re-embedding can change recall and latency at the same time.
  • Index rebuilds can become release events if the index no longer builds comfortably.
  • Autovacuum lag can become a retrieval quality issue, not just a storage issue.

This is where "it worked in staging" stops helping. Staging rarely has the same delete patterns, tenant skew, or re-embedding backlog as production.

The dashboard I want before scaling pgvector

  • Recall@k by tenant size: global recall hides small-corpus failures.
  • Rows requested versus rows returned: the model cannot use chunks the query did not return.
  • Filter selectivity: track how much each filter narrows the corpus.
  • Dead tuple and vacuum lag: retrieval quality can drift after churn.
  • Index build time and size: the rebuild window is part of the architecture.
  • p95 and p99 under concurrency: single-query latency is a demo metric.

When pgvector is still the right answer

Keep pgvector when the team benefits from one database and the measured recall is good enough. That often means early or mid-stage RAG, internal assistants, customer support search, product documentation search, and workloads where product truth already lives in Postgres.

The win is not only cost. It is one backup story, one permission model, one transaction boundary, and one place to debug stale document state.

When pgvector has become the wrong simplification

Move the retrieval workload out when vector search needs independent scaling, strict p99, high update cadence, and heavy filters where recall cannot be allowed to drift. At that point, keeping retrieval inside the OLTP database can make both systems worse.

The line is not a fixed number of vectors. The line is when retrieval has its own SLO and its own failure modes. Once that happens, treat it like a serving system.