Most vector search dashboards start with latency and error rate. That is necessary, but it misses the incident users actually feel: the system answers from stale evidence, misses the right document, or returns five chunks when it promised twenty.
RAG observability has to connect database behavior to answer quality.
The framework: search, evidence, answer
I split the dashboard into three layers. Search metrics show whether retrieval worked. Evidence metrics show whether the context was usable. Answer metrics show whether the model grounded its response.
- Search: latency, result count, recall@k, filter selectivity, index used.
- Evidence: citation coverage, freshness, permission validity, duplicate chunks.
- Answer: grounded answer rate, refusal accuracy, cost per accepted answer.
The metrics that catch real incidents
- Queries returning fewer than requested top_k after filters.
- Recall drift after re-embedding or index rebuilds.
- Deleted-document or permission-leak canaries.
- Fresh-document hit rate.
- Tenant-level p99 and recall split by corpus size.
- Reranker latency and cost.
- Embedding sync lag and dead-lettered events.
Postgres signals for pgvector
When pgvector runs inside Postgres, normal database signals still matter. Autovacuum lag, index size, dead tuples, WAL bursts, lock waits, and query plans can all become retrieval quality problems.
SELECT relname, n_live_tup, n_dead_tup, last_autovacuum
FROM pg_stat_user_tables
WHERE relname IN ('document_chunks', 'document_embeddings');
Quality alerts need canaries
Synthetic canaries are useful for permission and freshness failures. Add a deleted document, a restricted document, and a fresh document to the evaluation set. The system should not retrieve the first two for the wrong user, and it should retrieve the fresh one after the sync budget.
The production default
Do not scale RAG until you can see quality drift. Latency tells you whether the system is fast. Observability tells you whether the system is still allowed to be trusted.
The runbook I want before this reaches production
Before I trust this design, I want a small runbook that names the failure mode, the owner, and the rollback path. Vector systems fail in ways that look like product quality problems: missing evidence, stale evidence, wrong-tenant evidence, high p99, or answers that cite weak chunks. If the team cannot tell which one happened, the system is not observable enough.
- Define a golden query set with real permissions and expected source documents.
- Track recall, result count, p95, p99, and cost by query class.
- Keep a rollback path for index, embedding model, chunking, and metadata changes.
- Test deleted, restricted, fresh, and re-embedded documents as canaries.
- Review the dashboard after every bulk import, re-embedding job, and index rebuild.
The practical standard is simple: a retrieval change should be measurable before it ships, visible while it runs, and reversible when quality drops.