The migration looked harmless. Add a column, rename something, validate a constraint, maybe create an index. Then traffic stopped. The database was not down, CPU was not high, and the slow queries were all normal app queries.
This is the lock queue trap. A DDL statement waits for a strong lock. New queries that conflict with that pending lock line up behind it. The migration is not doing work, but its place in the queue changes the behavior of everything behind it.
PostgreSQL lock incidents are often not about one query being slow. They are about one lock request changing the order of the room.
Find blockers and blocked sessions
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocker.pid AS blocker_pid,
blocker.state AS blocker_state,
blocker.query AS blocker_query
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks
ON blocked_locks.pid = blocked.pid AND NOT blocked_locks.granted
JOIN pg_locks blocker_locks
ON blocker_locks.locktype = blocked_locks.locktype
AND blocker_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocker_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocker_locks.pid <> blocked_locks.pid
JOIN pg_stat_activity blocker ON blocker.pid = blocker_locks.pid
WHERE blocker_locks.granted;
Use timeouts on migrations
A migration should fail fast if it cannot get the lock it needs. The dangerous migration is the one that waits quietly in production while traffic queues behind it.
SET lock_timeout = '5s';
SET statement_timeout = '5min';
ALTER TABLE accounts ADD COLUMN risk_score integer;
Prefer phased changes
- Add nullable columns first.
- Backfill in small batches.
- Create indexes concurrently when possible.
- Add constraints as NOT VALID, then validate later.
- Deploy application reads and writes in compatible phases.
- Drop old columns only after the old code path is gone.
The incident response
If a migration is waiting for a lock and production traffic is queued behind it, cancel the migration first unless you have a very specific reason not to. Then redesign the change so it fails fast, runs in phases, or happens during a real maintenance window.
The practical standard
The best PostgreSQL performance work is boring in the right way. Name the failure mode, capture the before plan or metric, make one change, and compare the exact same signal afterward. Anything else is just a more confident guess.