February 24, 20266 min read

Postgres Failover Readiness: The Drill That Tells You If You Are Lying to Yourself

Failover is mostly fine when you do not need it and broken when you do. Here is how to know which you have.

Three teams I have worked with believed they had high availability. None of the three had ever actually failed over in production. When they finally tried — twice as part of a planned drill, once during an actual incident — all three found something broken.

Failover is one of those operational capabilities where confidence and reality often disagree. The only way to know is to test.

What failover actually requires

A functional failover capability has four working parts:

A replica that is current. Lag close to zero, all WAL applied.
A way to promote the replica to primary. Either manual (pg_promote()) or automated (Patroni, AWS RDS Multi-AZ, etc.).
A way to redirect application traffic. DNS, virtual IP, connection-string updates, or a proxy that handles it.
An application that survives the redirect. Reconnects cleanly, retries failed transactions, does not corrupt state during the brief outage.

Most outages I have helped with had two of these four working. The other two were broken in ways nobody had noticed because nobody had tested.

The pre-flight checks

Before any failover, run:

-- On the primary
SELECT * FROM pg_stat_replication;

For each replica, you want:

state = 'streaming'
sync_state either sync (synchronous) or async — not potential or quorum-pending
replay_lag near zero (under a second is healthy)
flush_lsn and replay_lsn close to current pg_current_wal_lsn

If any replica is not in this state, failover to it will lose data or fail outright.

A planned failover drill

The minimum drill:

Pick a low-traffic window.
Confirm the replica is healthy (above checks).
Stop writes to the primary (application maintenance mode).
Wait for replication to fully drain.
Promote the replica.
Update DNS / connection strings to point at the new primary.
Resume writes.
Verify application traffic is going to the new primary.
Demote the old primary or set it up as a new replica.

For managed services, steps 5 and 6 are often combined into one click (RDS "Reboot with failover"). For self-managed setups, each step is its own command.

The drill should take 5-15 minutes start to finish. If it takes more, the incident-time version will be worse.

What goes wrong on first drills

Things I have seen on drills that had not been done before:

Replica was much further behind than the dashboards showed. The dashboard said "<1 second lag" but the actual replay_lag was 30 seconds because the metric was averaged.

Application could not reconnect to the new primary. The connection string was hardcoded to the old hostname. After failover, the application kept trying the old host.

Application reconnected but had cached query plans tied to the old primary. Some ORMs hold prepared statements that fail after the connection drops. The fix was a full process restart.

Sequence drift. The replica's sequences had not been advanced because logical replication does not replicate them. Cutover succeeded but the next insert collided with an existing primary key.

The promotion succeeded but the old primary did not stop. Both sides now think they are primary. Split-brain. Data integrity is at risk until one is forcibly stopped.

The connection pooler did not refresh. PgBouncer was still pointed at the old host. Application was talking to PgBouncer; PgBouncer was talking to nothing.

Each of these is fixable. None of them are fast to figure out during an incident.

Application-side readiness

The application part of failover is often the most under-tested. Three things to verify:

1. The connection logic retries on failure. Most database libraries have built-in retry, but the configuration varies. Confirm that:

Connection failures retry, not just immediately fail.
Idle connections are detected and replaced (not assumed alive forever).
The retry budget is reasonable — too short and the application gives up before failover finishes.

2. Transactions that were in flight during failover are handled. Some will fail with errors. The application should treat those as retryable for idempotent transactions and as failed for non-idempotent ones. The distinction is the application's responsibility.

3. Cached state is invalidated on disconnect. Connection pools, prepared statement caches, query result caches — anything that might hold a reference to the old primary needs to be flushed.

The drill exposes which of these is missing.

Post-failover verification

After the failover, run:

-- Confirm the new primary is writable
INSERT INTO _failover_log (failed_over_at, role) VALUES (now(), 'new_primary');

-- Confirm the old primary is not writable (or is fenced)
-- (run on old primary; should fail or be in standby mode)
SELECT pg_is_in_recovery();

The _failover_log table is a poor man's verification — write a row, see that it lands. For more rigor, run a sample of representative application transactions and confirm they all succeed.

Replication health on the new topology:

SELECT * FROM pg_stat_replication;

If the old primary is being set up as a new replica, it should appear here within minutes.

What I commit to

For any production database with HA pretensions:

Failover drill quarterly. On the actual production cluster, in a planned window.
Application-side reconnect drill quarterly. Force-restart the database, confirm the application recovers.
Replication lag alert: paged at >5 seconds.
Replication state alert: paged at any state other than streaming for >1 minute.

This is the discipline that separates teams who have HA from teams who think they have HA. The drills are not free, but they are much cheaper than discovering the failover is broken during an actual outage.

← Back to all articles