May 4, 20266 min read

RAG Quality Metrics: Stop Measuring Only Latency

A fast RAG system can still be wrong. Production teams need recall@k, MRR, answerability, citation coverage, freshness, no-hit rate, and drift signals.

The dashboard said the RAG system was healthy. p95 was low, errors were near zero, and token cost was inside budget. Users still complained that the assistant missed documents they knew existed.

The system was fast. It was not good. That distinction is the center of RAG operations.

The framework: retrieval quality is a product metric

I split RAG quality into retrieval, grounding, and answer behavior. Vector database metrics mostly prove that queries ran. They do not prove that the right evidence reached the model.

Recall@k: did the correct evidence appear in the retrieved set?
MRR: how high did the first useful result rank?
nDCG: were better results ranked above weaker ones?
Citation coverage: did the final answer cite the retrieved evidence?
No-hit rate: did the system correctly refuse when evidence was missing?

Build a golden query set

A golden set should include real user questions, expected documents, permission contexts, and query class labels. Do not fill it only with easy semantic queries. Add error codes, customer-specific terms, recent uploads, deleted documents, and questions that should not be answered.

{
  "query": "How do I fix ERR_BILLING_412?",
  "tenant": "acme",
  "acl": "support_agent",
  "expected_documents": ["runbook-billing-412"],
  "query_class": "exact_identifier"
}

Measure before the model answers

Evaluate retrieval before generation. If the right chunk is not in the context window, the language model cannot reliably cite it. This separates search problems from prompting problems.

Watch drift after normal operations

RAG quality changes after bulk re-embedding, metadata migrations, deletes, ACL changes, model upgrades, and index rebuilds. A one-time benchmark at launch is not enough.

Recall split by tenant and corpus size.
Recall split by query class.
Fresh-document hit rate.
Deleted-document leak checks.
Result count below requested top_k.
Cost and latency per successful grounded answer.

The production default

I treat latency as a guardrail and recall as the product metric. Fast retrieval is useful only after the system proves it can find the evidence users expect. The mistake is tuning vector search by p99 alone.

The runbook I want before this reaches production

Before I trust this design, I want a small runbook that names the failure mode, the owner, and the rollback path. Vector systems fail in ways that look like product quality problems: missing evidence, stale evidence, wrong-tenant evidence, high p99, or answers that cite weak chunks. If the team cannot tell which one happened, the system is not observable enough.

Define a golden query set with real permissions and expected source documents.
Track recall, result count, p95, p99, and cost by query class.
Keep a rollback path for index, embedding model, chunking, and metadata changes.
Test deleted, restricted, fresh, and re-embedded documents as canaries.
Review the dashboard after every bulk import, re-embedding job, and index rebuild.

The practical standard is simple: a retrieval change should be measurable before it ships, visible while it runs, and reversible when quality drops.

← Back to all articles