7 min read

Normalize vs Denormalize: When Each One Earns Its Keep

Normalization and denormalization are not opposing camps; they are two answers to two different questions. The trick is knowing which question you are actually asking.

The normalize-vs-denormalize argument is mostly about which kind of bug you would rather debug. I have caused plenty of both. Here is the framework I use now.

The default is normalized

Start normalized. The reason is simple: normalized schemas make correctness easy and performance hard, while denormalized schemas make performance easy and correctness hard. Of those two failure modes, the correctness one is much more expensive to recover from.

A normalized schema means every fact lives in exactly one place. Updating a customer's email updates the row in customers, and every order's view of that customer reflects it. No drift, no synchronization, no "why does this report show stale data."

CREATE TABLE customers (
  id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  email TEXT NOT NULL UNIQUE,
  display_name TEXT NOT NULL
);

CREATE TABLE orders (
  id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  customer_id BIGINT NOT NULL REFERENCES customers(id),
  total_cents INTEGER NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Queries do JOINs. JOINs are usually cheaper than people think when the indexes are right.

Signs the JOIN is actually expensive

Denormalize when, and only when, you can show that the JOIN is causing real pain. The signs:

  • The query shape is read-heavy and the same JOIN is the dominant cost in pg_stat_statements. We are not talking about an analyst's report that runs once a day. We are talking about a hot path the application hits thousands of times per minute.
  • The denormalized data does not change frequently. If the customer's display_name changes once a year, copying it into orders is cheap to maintain. If it changes daily, denormalizing creates a synchronization problem.
  • The denormalized field is small. Copying a TEXT label is fine. Copying a JSON blob with thousands of nested fields creates write amplification.

If any of those is false, do not denormalize. Add an index, fix the query, or cache at the application layer instead.

How I denormalize safely

When the case is real, the shape I use is a denormalized column kept in sync by a trigger or by the application:

ALTER TABLE orders
  ADD COLUMN customer_display_name TEXT;

-- Backfill once
UPDATE orders
SET customer_display_name = customers.display_name
FROM customers
WHERE customers.id = orders.customer_id
  AND orders.customer_display_name IS NULL;

-- Keep in sync via trigger
CREATE OR REPLACE FUNCTION sync_customer_display_name()
RETURNS TRIGGER AS $$
BEGIN
  UPDATE orders
  SET customer_display_name = NEW.display_name
  WHERE customer_id = NEW.id
    AND customer_display_name IS DISTINCT FROM NEW.display_name;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER customers_after_update_display_name
AFTER UPDATE OF display_name ON customers
FOR EACH ROW EXECUTE FUNCTION sync_customer_display_name();

The trigger keeps the denormalized field consistent without the application having to remember. This is more maintainable than "every UPDATE site in the application code must also update the orders table."

When to use generated columns

For expressions of the same row's data — formatted timestamps, derived totals — use a generated column instead of denormalization-by-trigger:

ALTER TABLE orders
  ADD COLUMN total_dollars NUMERIC(12,2)
  GENERATED ALWAYS AS (total_cents / 100.0) STORED;

Generated columns are a clean middle ground: they look denormalized to readers and stay correct automatically. They cost some write throughput, but the consistency is maintained by Postgres.

Materialized views: useful, but commit to the maintenance

For expensive aggregations — "top 100 customers by lifetime value" — a materialized view is the right tool:

CREATE MATERIALIZED VIEW customer_lifetime_value AS
SELECT customer_id, SUM(total_cents) AS lifetime_cents, COUNT(*) AS order_count
FROM orders
GROUP BY customer_id;

CREATE UNIQUE INDEX customer_lifetime_value_pk
  ON customer_lifetime_value (customer_id);

The upside: very fast reads. The downside: it is stale until you REFRESH MATERIALIZED VIEW. Plan when refresh runs and how stale data is acceptable. CONCURRENTLY lets refresh run without blocking reads but requires a unique index and is slower.

Where I have seen this go wrong: teams add a materialized view to fix a slow query, then forget to schedule the refresh, then notice three months later that the dashboard has been showing stale data the whole time.

A pragmatic rule of thumb

If I am writing the schema fresh, I normalize and add the obvious indexes. I do not denormalize until a real query is slow and a real measurement justifies it. The denormalization should be in the form of a generated column, a trigger-maintained column, or a materialized view — never a column the application has to remember to update.

The systems I have seen that aged worst were the ones that denormalized aggressively early. Every schema change was a hunt to find every place a fact was duplicated. The systems that aged best were normalized with a few targeted denormalizations, each with a one-line comment in the migration explaining the query shape that justified it.