We once had a replication lag spike to 3 hours because someone ran a massive DELETE on 40 million rows without batching. The WAL flood was so intense that the standby couldn't keep up. Nobody noticed for 45 minutes because we weren't monitoring replication lag — we were only checking "is the standby connected?"
Connected is not the same as current. That lesson cost us a near-miss on a failover where the standby was 3 hours behind.
How Streaming Replication Works
The primary writes changes to its Write-Ahead Log (WAL). The standby connects to the primary, streams WAL segments, and replays them to update its own data. This happens continuously and (usually) very quickly — under normal conditions, lag should be under a second.
When it's not under a second, you need to know. When it's minutes or hours, you have a serious problem — either a failing standby or one that won't be useful for failover.
Monitoring Lag from the Primary
-- On the primary: current replication state for all standbys
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag,
sync_state
FROM pg_stat_replication
ORDER BY replay_lag DESC;
The replay_lag column is what you actually care about — that's how far behind the standby is in replaying WAL. write_lag and flush_lag are intermediate steps.
For the lag in bytes (useful for understanding how much WAL the standby hasn't processed):
SELECT
client_addr,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
replay_lag AS replay_lag_time
FROM pg_stat_replication;
Monitoring from the Standby
Sometimes you don't have easy access to the primary (managed services, network topology). Check from the standby:
-- On the standby
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_delay,
pg_is_in_recovery() AS is_standby,
pg_last_wal_receive_lsn() AS received_lsn,
pg_last_wal_replay_lsn() AS replayed_lsn,
pg_wal_lsn_diff(
pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn()
) AS lag_bytes;
If received_lsn = replayed_lsn, the standby has replayed everything it received. If the gap is large, the standby is falling behind in applying WAL — this often points to resource constraints on the standby (CPU, disk I/O).
If pg_last_xact_replay_timestamp() is stale by minutes on an active database, you're either disconnected from the primary or not receiving WAL.
Replication Slots: Powerful but Dangerous
Replication slots guarantee the primary will keep WAL segments until the consumer confirms it has processed them. This sounds great — and it is, for logical replication consumers that might be temporarily offline.
But here's the problem: if a replication slot consumer goes offline and never comes back, the primary keeps accumulating WAL segments indefinitely. I've seen this fill up a disk completely within 24 hours on a busy database.
-- Check all replication slots and their lag
SELECT
slot_name,
plugin,
slot_type,
active,
restart_lsn,
confirmed_flush_lsn,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_pretty
FROM pg_replication_slots
ORDER BY lag_bytes DESC;
Any inactive slot (active = false) with a large lag is a disk bomb waiting to go off. Either fix the consumer or drop the slot:
-- Drop an abandoned slot (make sure you really don't need it)
SELECT pg_drop_replication_slot('my_abandoned_slot');
Alert if any slot has lag over 1GB. If it hits 10GB, treat it as an emergency.
Logical Replication Monitoring
Logical replication (used by tools like Debezium, pglogical, and native logical replication) has its own monitoring view:
-- Logical replication worker status
SELECT
subname AS subscription_name,
pid,
relid::regclass AS table_name,
received_lsn,
last_msg_send_time,
last_msg_receipt_time,
latest_end_lsn
FROM pg_stat_subscription;
-- Subscription lag
SELECT
subname,
pg_wal_lsn_diff(pg_current_wal_lsn(), received_lsn) AS lag_bytes
FROM pg_stat_subscription;
Managed Service Specifics
The good news: RDS, Aurora, Azure Database for PostgreSQL, and Cloud SQL all give you replication lag metrics through their monitoring dashboards. The bad news: they also abstract away a lot of the control.
Amazon RDS/Aurora:
- CloudWatch metric:
ReplicaLag(in seconds) - Also check
AuroraReplicaLagfor Aurora replicas - Use the AWS console or:
aws cloudwatch get-metric-statistics --metric-name ReplicaLag ...
Azure Database for PostgreSQL:
- Monitor > Metrics > Max Lag Across Replicas
- Also available in Log Analytics
Google Cloud SQL:
- Monitoring > PostgreSQL > Replica lag
- Available in Cloud Monitoring as
database/postgresql/replication/replica_byte_lag
One thing managed services often don't expose well: replication slot lag. Even on RDS, slots can accumulate WAL. Monitor pg_replication_slots directly even on managed services.
Setting Up Alerting
What to alert on:
- Replication lag > 30 seconds — warning
- Replication lag > 5 minutes — critical; probably can't failover safely
- Any inactive replication slot with lag > 500MB — warning
- Standby disconnected for > 30 seconds — critical
-- Simple query to check alert conditions
SELECT
CASE
WHEN pg_is_in_recovery() THEN 'standby'
ELSE 'primary'
END AS role,
CASE
WHEN NOT pg_is_in_recovery() THEN
(SELECT count(*) FROM pg_stat_replication WHERE state != 'streaming')
ELSE NULL
END AS disconnected_standbys,
CASE
WHEN pg_is_in_recovery() THEN
extract(epoch FROM (now() - pg_last_xact_replay_timestamp()))
ELSE NULL
END AS standby_lag_seconds;
MonPG monitors replication lag continuously and alerts before it becomes a failover problem. But even if you're doing this manually, the queries above should be in your runbook.
The 3-hour lag incident taught me: "standby is connected" is not a monitoring metric. Lag is the metric. Watch it constantly.