I have written and shipped a lot of "zero-downtime" migrations that turned out not to be. The locking surprises in Postgres are mostly well-documented, but the docs are scattered across multiple chapters and a half-dozen blog posts. Most teams learn the rules by failing in production once.
Here are the three patterns that cover the migrations I do most often. Each one assumes you cannot take downtime.
Pattern 1: Add a column without breaking writes
Adding a NOT NULL column with a default to a large table used to be a guaranteed locking nightmare. Postgres 11+ made it much better, but the rules are still subtle.
Pre-Postgres 11: the new column had to be filled in for every existing row, which required a full table rewrite under exclusive lock. On a 50GB table, that was 30 minutes of write downtime.
Postgres 11+: adding a NOT NULL column with a constant default is metadata-only. The default is stored in pg_attrdef and applied at read time for old rows. The operation completes instantly.
-- Fast on Postgres 11+: metadata-only
ALTER TABLE orders
ADD COLUMN priority INTEGER NOT NULL DEFAULT 0;
Where people still get caught:
- Volatile defaults.
DEFAULT now()is not a constant. Postgres has to materialize the value for every row, which means a table rewrite. - Adding a CHECK constraint at the same time. The CHECK has to validate every existing row, which requires a scan.
The safe shape: add the column with a constant default (or no default, if the column is nullable), then add CHECK constraints NOT VALID, then validate them in a separate transaction:
-- Step 1: add column (fast on PG11+)
ALTER TABLE orders ADD COLUMN priority INTEGER NOT NULL DEFAULT 0;
-- Step 2: add constraint as NOT VALID (fast)
ALTER TABLE orders
ADD CONSTRAINT orders_priority_range
CHECK (priority BETWEEN 0 AND 10) NOT VALID;
-- Step 3: validate in background (no exclusive lock)
ALTER TABLE orders VALIDATE CONSTRAINT orders_priority_range;
Pattern 2: Change a column type or rename it
Direct ALTER TABLE ... ALTER COLUMN type ... rewrites the table. On a large hot table, this is unacceptable.
The pattern is expand-and-contract:
- Expand. Add a new column with the new type. Backfill in chunks. Make the application write both columns.
- Verify. After backfill is complete, write a check that the two columns are consistent. Let it run for a day to catch edge cases.
- Contract. Stop writing the old column. Drop it.
-- Step 1: add new column
ALTER TABLE events ADD COLUMN created_at_utc TIMESTAMPTZ;
-- Step 2: backfill in chunks (run repeatedly with different ranges)
UPDATE events
SET created_at_utc = created_at AT TIME ZONE 'UTC'
WHERE created_at_utc IS NULL
AND id BETWEEN 0 AND 100000;
-- Step 3: deploy application that writes both columns
-- Step 4: verify consistency
SELECT count(*) FROM events
WHERE created_at_utc IS DISTINCT FROM (created_at AT TIME ZONE 'UTC');
-- Step 5: deploy application that reads from new column
-- Step 6: drop old column
ALTER TABLE events DROP COLUMN created_at;
ALTER TABLE events RENAME COLUMN created_at_utc TO created_at;
The whole sequence is a multi-week project on a busy table. The benefit: no downtime, no exclusive locks, no risk of being half-done in a bad way.
For very large tables, the backfill is the bottleneck. Run it in batches small enough not to stress the database (50,000 to 500,000 rows per batch is usually safe). Pause between batches to give vacuum time to clean up.
Pattern 3: Add or drop indexes without locking
This is the most common case where the right knowledge prevents an outage:
-- Locks the table for the duration. DO NOT DO THIS on a busy table.
CREATE INDEX orders_customer_id_idx ON orders (customer_id);
-- Does not lock writes. Use this in production.
CREATE INDEX CONCURRENTLY orders_customer_id_idx ON orders (customer_id);
CONCURRENTLY builds the index without an exclusive lock. The cost is that it runs more slowly (reads the table twice) and can fail partway through, leaving an INVALID index that you have to clean up:
-- Find invalid indexes
SELECT indexrelid::regclass
FROM pg_index
WHERE NOT indisvalid;
-- Drop them concurrently too
DROP INDEX CONCURRENTLY name_of_invalid_index;
The same applies to DROP INDEX. The non-concurrent version takes a brief but exclusive lock; on a busy index, that can cause a queue of waiting queries. DROP INDEX CONCURRENTLY avoids this.
Limitation: CREATE INDEX CONCURRENTLY cannot run inside a transaction. It must be its own statement. Many migration tools wrap each migration in a transaction by default; you need a way to opt out for these statements.
What about the locking that surprises people
A few operations that look harmless but actually take exclusive locks:
- Adding a CHECK constraint without
NOT VALID. Validates the entire table under lock. - Adding a foreign key. Validates the entire table under lock unless added
NOT VALIDfirst. - Renaming a column. Quick metadata change, but takes a brief exclusive lock that can queue queries.
- Adding a
UNIQUEconstraint withoutUSING INDEX. Builds an index under lock, then enforces uniqueness.
For each of these, there is a non-blocking variant. The pattern is the same: do the validation work upfront under no lock, then apply the metadata change under a fast lock.
A migration tool's opinion
Different tools have different defaults. gh-ost, pg_repack, and similar handle some of these patterns automatically. Built-in tools (Flyway, Liquibase, your ORM's migration framework) usually do not. If your migration tool wraps everything in a transaction by default, learn how to opt out for CONCURRENTLY operations.
The rule I leave behind for any team: every migration is reviewed for locking implications. The reviewer reads the SQL and asks, "What lock does this take, and for how long?" If the answer is "exclusive, until done," and the table is busy, the migration gets rewritten in the expand-and-contract pattern.
What I do for every migration now
- Run it in a clone first. Time the operation. Confirm no exclusive locks held longer than a second.
- For column type changes and bulk data work, plan a multi-week sequence with backfill, dual-write, verify, cutover.
- For indexes, always
CONCURRENTLY. - For constraints, always
NOT VALIDfirst, thenVALIDATE. - Document the sequence in the migration directory so the next person on call can reason about it.
Zero-downtime migration is mostly discipline plus a small set of patterns. The patterns are not hard. The discipline is what separates teams that ship cleanly from teams that schedule maintenance windows for every schema change.