The first RAG bill looked fine. Then the corpus doubled, the team increased top_k, a reranker was added, support uploaded attachments, and nightly re-embedding became normal. The bill did not grow linearly. Neither did latency.
Vector search cost is a system cost, not a line item called storage.
The framework: cost per grounded answer
I do not optimize for cost per vector. I optimize for cost per useful grounded answer. That includes embedding generation, vector storage, index memory, retrieval, reranking, model context, and operational rebuilds.
- How many chunks per document?
- How many dimensions per embedding?
- How many vectors are searched per user request?
- How large is metadata payload?
- How often do chunks get re-embedded?
- How often are indexes rebuilt or compacted?
The top_k trap
When quality feels weak, teams often raise top_k. That can help recall, but it also increases vector work, payload transfer, reranker input, and model context size. If the problem is bad chunking or filters, a larger top_k just makes the wrong system more expensive.
Dimensions are not free
Higher-dimensional embeddings can improve quality for some domains, but every vector becomes larger. Index memory, disk, network transfer, and cache behavior all move with dimensionality. Benchmark quality before accepting the cost.
-- Rough raw vector storage intuition for float4 vectors.
-- rows * dimensions * 4 bytes, before indexes and metadata.
SELECT
10000000::numeric * 1536 * 4 / 1024 / 1024 / 1024 AS raw_gb;
Re-embedding is a recurring bill
Model upgrades, chunking changes, permissions migrations, and content cleanup all create re-embedding work. Treat that as part of the cost model from the beginning. A cheap steady state can become expensive if every quality fix requires regenerating the corpus.
The production default
Track cost per successful answer by tenant, query class, and corpus. Keep a dashboard that includes retrieval latency, reranker calls, context tokens, embedding refresh rate, and storage growth. The mistake is budgeting vector search like static file storage.
The runbook I want before this reaches production
Before I trust this design, I want a small runbook that names the failure mode, the owner, and the rollback path. Vector systems fail in ways that look like product quality problems: missing evidence, stale evidence, wrong-tenant evidence, high p99, or answers that cite weak chunks. If the team cannot tell which one happened, the system is not observable enough.
- Define a golden query set with real permissions and expected source documents.
- Track recall, result count, p95, p99, and cost by query class.
- Keep a rollback path for index, embedding model, chunking, and metadata changes.
- Test deleted, restricted, fresh, and re-embedded documents as canaries.
- Review the dashboard after every bulk import, re-embedding job, and index rebuild.
The practical standard is simple: a retrieval change should be measurable before it ships, visible while it runs, and reversible when quality drops.