6 min read

The Postgres Migration Review Checklist I Use Before Approving Any Schema Change

Most production schema bugs are caused by migrations that looked fine in review. Here is the checklist that catches the dangerous ones before they merge.

The migrations that cause production incidents almost always looked fine at PR review. The reviewer skimmed the SQL, it was syntactically clean, they approved. The locking implication, the missing FK index, the constraint that would scan a billion rows — none of these are obvious from the SQL itself.

A checklist catches them. Twenty seconds per migration prevents most production migration incidents. Here is the one I use.

1. What lock does this take and for how long?

This is the single most important question. The categories:

  • ACCESS EXCLUSIVE for the duration of a table rewrite: catastrophic. Examples: ALTER COLUMN ... TYPE for an actual type change, ADD COLUMN ... DEFAULT <volatile>, CLUSTER. For busy tables, these are unacceptable directly.
  • ACCESS EXCLUSIVE briefly: usually fine. Examples: RENAME COLUMN, DROP CONSTRAINT, metadata-only ADD COLUMN. Brief locks queue queries but resolve quickly.
  • SHARE UPDATE EXCLUSIVE: non-blocking for most traffic. Examples: most CONCURRENTLY operations.

For any migration on a busy table, the first review question: does this take an exclusive lock for an extended duration? If yes, rewrite as expand-and-contract.

2. Are CONCURRENTLY versions used where possible?

-- Wrong on a busy table:
CREATE INDEX orders_customer_idx ON orders (customer_id);

-- Right:
CREATE INDEX CONCURRENTLY orders_customer_idx ON orders (customer_id);

For any CREATE INDEX, DROP INDEX, or REINDEX on a production table: use CONCURRENTLY. The tradeoff is slower execution, but no exclusive lock.

Note: CREATE INDEX CONCURRENTLY cannot run inside a transaction. Migration tools that wrap migrations in transactions need a way to opt out for these statements.

3. Is the FK index added at the same time as the FK?

-- Wrong:
ALTER TABLE orders ADD COLUMN customer_id BIGINT REFERENCES customers(id);

-- Right:
ALTER TABLE orders ADD COLUMN customer_id BIGINT REFERENCES customers(id);
CREATE INDEX CONCURRENTLY orders_customer_id_idx ON orders (customer_id);

Postgres does not auto-index the child side of an FK. Without the index, cascading deletes and parent-side lookups are slow.

Missing this index is one of the most common production bugs introduced by migrations.

4. Are constraints added with NOT VALID first?

-- Wrong: validates entire table under lock
ALTER TABLE orders ADD CONSTRAINT orders_total_positive
  CHECK (total_cents >= 0);

-- Right: two-step, no exclusive lock on existing data
ALTER TABLE orders ADD CONSTRAINT orders_total_positive
  CHECK (total_cents >= 0) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_total_positive;

For any ADD CONSTRAINT on a populated table, the two-step pattern. The first step is fast (metadata change). The second step scans the table but does not take an exclusive lock.

5. Does the migration have a rollback?

Migration frameworks usually have an explicit rollback path (down migration in Rails, down script in Flyway). For each migration, ask: "If this migrates production wrong, what undoes it?"

For schema additions (ADD COLUMN, CREATE INDEX), the rollback is clean: drop the addition.

For schema removals (DROP COLUMN, DROP TABLE), the rollback is the addition back, but the data is gone. Make sure the team has a real plan for what "undo" means.

For data migrations (UPDATE ... to change values), the rollback is often impossible without a backup. These migrations are the most dangerous.

6. Will the migration timeout under realistic load?

For migrations that scan or modify many rows, what is the expected duration on production data size?

-- This is fine for 1k rows. Catastrophic for 100M.
UPDATE orders SET status = 'archived' WHERE created_at < '2020-01-01';

For any data migration, the question: how many rows? At what rate? Does the migration tool's timeout match?

If the migration would take more than a few minutes, rewrite as a chunked batch job. The migration framework runs the chunked driver; the actual data work happens incrementally.

7. What happens if the migration is interrupted?

For migrations that complete in a single transaction, interruption rolls back everything. Clean.

For migrations that span multiple transactions (CREATE INDEX CONCURRENTLY, batched updates, multi-step DDL), interruption can leave partial state. The migration tool should detect this and either resume cleanly or require manual cleanup.

For PostgreSQL-specific gotchas:

  • An interrupted CREATE INDEX CONCURRENTLY leaves an INVALID index that has to be dropped.
  • An interrupted ALTER TABLE ... ADD CONSTRAINT NOT VALID is fine; the next attempt either retries or skips.

8. Does it require a follow-up migration?

Some patterns require multi-step deploys:

  • Add column NOT NULL with default → rename old column → drop old column. Three migrations across three deploys.
  • ADD CONSTRAINT NOT VALID → backfill data → VALIDATE. Two migrations.
  • Switch primary key column → re-create FKs → drop old. Three or more migrations.

For each of these, the review should verify: is this migration designed as part of a larger sequence, and does the sequence work?

9. Is the schema change reflected in application code?

A migration that adds a new column requires application changes to read/write that column. If the migration ships before the application change, the application breaks (writes fail because of the unknown column, or use stale schema).

The order matters:

  • Adding columns/tables: deploy schema first, then code that uses it.
  • Removing columns/tables: deploy code that stops using them first, then drop schema.
  • Renaming columns: most painful; requires expand-rename-contract sequence.

The review should call out the deploy order explicitly.

10. Does it have appropriate observability?

For non-trivial migrations:

  • Logging that confirms what was done.
  • Metrics on duration and rows affected.
  • An explicit verification step in the migration (or in deploy automation) that confirms the new state is correct.

For batched data migrations, this is essential. A migration that processes 50M rows should report progress and final count, not just "done."

What I do if any of the above fails

If the migration fails any of these checks, I block the PR with a comment explaining the issue. The author rewrites; we re-review. The cost is one round trip; the benefit is the migration not taking down production.

Most teams I work with adopt this checklist informally — once people have been burned a couple of times by missed checks, the discipline becomes natural. The cost is twenty seconds per migration; the benefit is a class of incidents you never have.

A pragmatic addition

For any migration on a known-busy table:

  • Run it in a clone first.
  • Time the operation.
  • Confirm no exclusive locks held longer than 1 second.
  • Confirm no replication lag spike beyond expected.

This catches issues that the static review cannot — actual locking behavior under realistic data sizes. A few minutes per migration; saves hours of incident response when something goes wrong.