February 21, 202612 min read

PostgreSQL Replication Monitoring: Lag, Slots, and Failover

Replication health is more than a lag number. Watch replay delay, slot retention, WAL growth, sync state, and the product impact of stale reads.

We once had a replication lag spike to 3 hours because someone ran a massive DELETE on 40 million rows without batching. The WAL flood was so intense that the standby couldn't keep up. Nobody noticed for 45 minutes because we weren't monitoring replication lag — we were only checking "is the standby connected?"

Connected is not the same as current. That lesson cost us a near-miss on a failover where the standby was 3 hours behind.

How Streaming Replication Works

The primary writes changes to its Write-Ahead Log (WAL). The standby connects to the primary, streams WAL segments, and replays them to update its own data. This happens continuously and (usually) very quickly — under normal conditions, lag should be under a second.

When it's not under a second, you need to know. When it's minutes or hours, you have a serious problem — either a failing standby or one that won't be useful for failover.

Monitoring Lag from the Primary

-- On the primary: current replication state for all standbys
SELECT
  client_addr,
  state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  write_lag,
  flush_lag,
  replay_lag,
  sync_state
FROM pg_stat_replication
ORDER BY replay_lag DESC;

The replay_lag column is what you actually care about — that's how far behind the standby is in replaying WAL. write_lag and flush_lag are intermediate steps.

For the lag in bytes (useful for understanding how much WAL the standby hasn't processed):

SELECT
  client_addr,
  pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
  replay_lag AS replay_lag_time
FROM pg_stat_replication;

Monitoring from the Standby

Sometimes you don't have easy access to the primary (managed services, network topology). Check from the standby:

-- On the standby
SELECT
  now() - pg_last_xact_replay_timestamp() AS replication_delay,
  pg_is_in_recovery() AS is_standby,
  pg_last_wal_receive_lsn() AS received_lsn,
  pg_last_wal_replay_lsn() AS replayed_lsn,
  pg_wal_lsn_diff(
    pg_last_wal_receive_lsn(),
    pg_last_wal_replay_lsn()
  ) AS lag_bytes;

If received_lsn = replayed_lsn, the standby has replayed everything it received. If the gap is large, the standby is falling behind in applying WAL — this often points to resource constraints on the standby (CPU, disk I/O).

If pg_last_xact_replay_timestamp() is stale by minutes on an active database, you're either disconnected from the primary or not receiving WAL.

Replication Slots: Powerful but Dangerous

Replication slots guarantee the primary will keep WAL segments until the consumer confirms it has processed them. This sounds great — and it is, for logical replication consumers that might be temporarily offline.

But here's the problem: if a replication slot consumer goes offline and never comes back, the primary keeps accumulating WAL segments indefinitely. I've seen this fill up a disk completely within 24 hours on a busy database.

-- Check all replication slots and their lag
SELECT
  slot_name,
  plugin,
  slot_type,
  active,
  restart_lsn,
  confirmed_flush_lsn,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_pretty
FROM pg_replication_slots
ORDER BY lag_bytes DESC;

Any inactive slot (active = false) with a large lag is a disk bomb waiting to go off. Either fix the consumer or drop the slot:

-- Drop an abandoned slot (make sure you really don't need it)
SELECT pg_drop_replication_slot('my_abandoned_slot');

Alert if any slot has lag over 1GB. If it hits 10GB, treat it as an emergency.

Logical Replication Monitoring

Logical replication (used by tools like Debezium, pglogical, and native logical replication) has its own monitoring view:

-- Logical replication worker status
SELECT
  subname AS subscription_name,
  pid,
  relid::regclass AS table_name,
  received_lsn,
  last_msg_send_time,
  last_msg_receipt_time,
  latest_end_lsn
FROM pg_stat_subscription;

-- Subscription lag
SELECT
  subname,
  pg_wal_lsn_diff(pg_current_wal_lsn(), received_lsn) AS lag_bytes
FROM pg_stat_subscription;

Managed Service Specifics

The good news: RDS, Aurora, Azure Database for PostgreSQL, and Cloud SQL all give you replication lag metrics through their monitoring dashboards. The bad news: they also abstract away a lot of the control.

Amazon RDS/Aurora:

CloudWatch metric: ReplicaLag (in seconds)
Also check AuroraReplicaLag for Aurora replicas
Use the AWS console or: aws cloudwatch get-metric-statistics --metric-name ReplicaLag ...

Azure Database for PostgreSQL:

Monitor > Metrics > Max Lag Across Replicas
Also available in Log Analytics

Google Cloud SQL:

Monitoring > PostgreSQL > Replica lag
Available in Cloud Monitoring as database/postgresql/replication/replica_byte_lag

One thing managed services often don't expose well: replication slot lag. Even on RDS, slots can accumulate WAL. Monitor pg_replication_slots directly even on managed services.

Setting Up Alerting

What to alert on:

Replication lag > 30 seconds — warning
Replication lag > 5 minutes — critical; probably can't failover safely
Any inactive replication slot with lag > 500MB — warning
Standby disconnected for > 30 seconds — critical

-- Simple query to check alert conditions
SELECT
  CASE
    WHEN pg_is_in_recovery() THEN 'standby'
    ELSE 'primary'
  END AS role,
  CASE
    WHEN NOT pg_is_in_recovery() THEN
      (SELECT count(*) FROM pg_stat_replication WHERE state != 'streaming')
    ELSE NULL
  END AS disconnected_standbys,
  CASE
    WHEN pg_is_in_recovery() THEN
      extract(epoch FROM (now() - pg_last_xact_replay_timestamp()))
    ELSE NULL
  END AS standby_lag_seconds;

MonPG monitors replication lag continuously and alerts before it becomes a failover problem. But even if you're doing this manually, the queries above should be in your runbook.

The 3-hour lag incident taught me: "standby is connected" is not a monitoring metric. Lag is the metric. Watch it constantly.

← Back to all articles