The first vector search demo usually has one table, one embedding column, and one query. It returns similar chunks quickly, and everyone leaves the meeting thinking the architecture is solved.
Production adds the missing parts: tenants, permissions, deleted documents, language filters, freshness windows, paid-plan boundaries, and compliance rules. The nearest vector is no longer enough. The nearest allowed vector is what matters.
That is where many RAG systems fail quietly. Latency looks fine. The answer is wrong often enough that users stop trusting it.
The framework: recall after filters
I treat filtered vector search as a product correctness problem before I treat it as an index tuning problem. The question is not whether the database can return ten vectors. The question is whether it can return ten useful vectors after every product rule has been applied.
- Measure recall on the filtered query, not only on global nearest-neighbor search.
- Track result count when filters are selective.
- Split metrics by tenant size, permission set, language, and document state.
- Test exact search as a reference on a controlled sample.
The query shape exposes the product contract
In pgvector, I prefer making every retrieval rule visible in the table. Tenant, ACL, deletion state, language, and embedding version are not metadata decorations. They decide whether the result is allowed to exist.
SELECT id, document_id
FROM document_chunks
WHERE organization_id = $1
AND acl_group_id = $2
AND is_deleted = false
AND embedding_model = 'text-embedding-3-small'
ORDER BY embedding <=> $3
LIMIT 20;
What breaks first
Tenant skew is usually first. One tenant has millions of chunks and another has a few thousand. The same top_k and index settings do not behave the same for both.
Permissions are second. If candidate generation happens before filtering, a user with limited access can lose the best matches because they belong to documents they cannot see.
Freshness is third. Deleted or re-embedded chunks may remain searchable if the vector store and source-of-truth database drift apart.
The operational default
Keep the retrieval rules close to the vector row and test recall after those rules. In pgvector, that means schema and indexes. In dedicated vector databases, that means metadata filter design, namespace strategy, and sync discipline.
The mistake is treating metadata filters as a secondary feature. In production RAG, filters are part of the answer.
The runbook I want before this reaches production
Before I trust this design, I want a small runbook that names the failure mode, the owner, and the rollback path. Vector systems fail in ways that look like product quality problems: missing evidence, stale evidence, wrong-tenant evidence, high p99, or answers that cite weak chunks. If the team cannot tell which one happened, the system is not observable enough.
- Define a golden query set with real permissions and expected source documents.
- Track recall, result count, p95, p99, and cost by query class.
- Keep a rollback path for index, embedding model, chunking, and metadata changes.
- Test deleted, restricted, fresh, and re-embedded documents as canaries.
- Review the dashboard after every bulk import, re-embedding job, and index rebuild.
The practical standard is simple: a retrieval change should be measurable before it ships, visible while it runs, and reversible when quality drops.