6 min read

Chunking Is a Database Design Problem, Not a Prompting Trick

Chunk size and overlap decide retrieval quality, citation accuracy, freshness, permissions, and cost. Treat chunks as serving data with ownership.

The team blamed the prompt for bad answers. The real problem was chunking. The right paragraph was split away from the table it explained, citations pointed to useless fragments, and permission changes applied to documents but not to old chunks.

Chunking is not preprocessing trivia. It is the schema of your retrieval system.

The framework: chunks are serving rows

A chunk should carry the data needed to retrieve it, authorize it, cite it, refresh it, and delete it. If those rules live only in the document store, the vector index will drift.

CREATE TABLE rag_chunks (
  chunk_id text PRIMARY KEY,
  document_id bigint NOT NULL,
  section_path text NOT NULL,
  acl_group_id bigint NOT NULL,
  content_hash text NOT NULL,
  chunk_text text NOT NULL,
  embedding vector(1536) NOT NULL,
  embedded_at timestamptz NOT NULL
);

Size is a tradeoff, not a default

Small chunks improve targeting but can lose context. Large chunks preserve context but increase cost and can dilute similarity. Overlap can help continuity, but too much overlap creates duplicate retrieval and higher storage cost.

Parent-child retrieval helps citations

For many domains, I retrieve small chunks and present a larger parent section to the model. That keeps candidate search precise while giving the answer enough context to cite accurately.

Chunking must respect permissions

If one document contains sections with different access rules, document-level ACL is not enough. Either split by permission boundary or store chunk-level access metadata. Otherwise vector search can surface text that the user should not see.

What I test

The mistake is treating chunking as a one-time ETL setting. It is a product-quality lever that needs versioning and evaluation.

  • Does each golden query retrieve the chunk that contains the answer?
  • Does the chunk include enough context to answer without guessing?
  • Do citations point to a useful section?
  • Do deleted or edited sections disappear from retrieval quickly?
  • Does overlap create duplicate context that wastes tokens?

The runbook I want before this reaches production

Before I trust this design, I want a small runbook that names the failure mode, the owner, and the rollback path. Vector systems fail in ways that look like product quality problems: missing evidence, stale evidence, wrong-tenant evidence, high p99, or answers that cite weak chunks. If the team cannot tell which one happened, the system is not observable enough.

  • Define a golden query set with real permissions and expected source documents.
  • Track recall, result count, p95, p99, and cost by query class.
  • Keep a rollback path for index, embedding model, chunking, and metadata changes.
  • Test deleted, restricted, fresh, and re-embedded documents as canaries.
  • Review the dashboard after every bulk import, re-embedding job, and index rebuild.

The practical standard is simple: a retrieval change should be measurable before it ships, visible while it runs, and reversible when quality drops.