I once took down an API for ninety seconds with a migration that, by itself, would have run in milliseconds. The ALTER TABLE wasn't slow. It was waiting — and while it waited, it made everyone else wait too.
That incident taught me the thing about PostgreSQL locks that the docs explain but you don't really feel until it bites: a blocked DDL statement doesn't just block, it forms a queue, and every query behind it stops.
How a fast statement causes a long outage
My ALTER TABLE needed a strong lock (ACCESS EXCLUSIVE). To get it, it had to wait for a long-running read that already held a weaker lock on the table. Fine — except once my statement was waiting for that lock, every new query that wanted even a read lock lined up behind it.
So the chain was: one slow query held the table, my DDL waited for it, and the entire application waited for my DDL. The migration itself was instant. The queue it created was the outage.
-- See the queue forming: who blocks whom
SELECT blocked.pid AS blocked_pid,
blocking.pid AS blocking_pid,
left(blocked_q.query, 60) AS blocked_query,
left(blocking_q.query, 60) AS blocking_query
FROM pg_locks blocked
JOIN pg_locks blocking
ON blocking.locktype = blocked.locktype
AND blocking.relation = blocked.relation
AND blocking.granted AND NOT blocked.granted
JOIN pg_stat_activity blocked_q ON blocked_q.pid = blocked.pid
JOIN pg_stat_activity blocking_q ON blocking_q.pid = blocking.pid;
Always set a lock timeout for DDL
The single habit that would have saved me: set a short lock_timeout before any migration. If the strong lock can't be acquired quickly, the statement fails fast instead of parking at the head of the queue and freezing everything behind it.
Failing the migration is annoying. Freezing production is a postmortem. I'll take the annoying one.
SET lock_timeout = '2s';
ALTER TABLE orders ADD COLUMN notes text;
-- If it can't grab the lock in 2s, it errors out instead of queueing the app
Make the lock window tiny
Modern PostgreSQL makes most migrations cheap if you let it. Adding a nullable column is instant. Adding NOT NULL or a default used to rewrite the table; now you add the column, backfill in batches, and add the constraint as NOT VALID, then VALIDATE separately so the heavy check doesn't hold the strong lock.
The pattern is always the same: hold the exclusive lock for as little as possible, and never while doing real work.
- Set lock_timeout (and a modest statement_timeout) before DDL.
- Split rewrites: add column, backfill in batches, validate constraints separately.
- Run migrations when the long-query traffic is quietest.
- Watch the blocking query — kill the holder before it kills you.
What I keep coming back to
None of this is clever. Name the failure mode, capture the before number, change one thing, and compare the same signal afterward. The boring version of database work is the version that actually holds up at 2 AM.