WAL is the write-ahead log. It is also, when things go wrong, the fastest way to fill a disk and force a database into recovery. The standard monitoring story I have inherited at most companies covers WAL volume but misses three signals that actually predict outages.
Here is the layout I have ended up with, and the thresholds I default to.
Signal 1: WAL generation rate
The basic metric. How fast is the database producing WAL bytes per second? This is your raw write throughput from a durability point of view.
SELECT pg_walfile_name_offset(pg_current_wal_lsn());
For delta, sample pg_current_wal_lsn() at intervals and subtract:
-- At time T1
SELECT pg_current_wal_lsn() AS lsn;
-- At time T2 (say, 60 seconds later)
SELECT pg_current_wal_lsn() - 'previous_lsn'::pg_lsn AS bytes_in_60s;
A sustained climb in WAL rate without a corresponding climb in application traffic is usually one of: a misbehaving job (mass UPDATEs that the application thinks are idempotent), a forgotten DDL replay, or a runaway autovacuum (FREEZE generates a lot of WAL).
Alert threshold: anything more than 2x the rolling 7-day average for more than 10 minutes deserves attention.
Signal 2: Replication lag
WAL is fine when it gets shipped to replicas in time. WAL is a problem when replicas fall behind, because:
- The primary cannot recycle WAL segments that replicas still need.
- Disk usage climbs as a function of the lag.
- A failover during this window means data loss equal to the lag.
The canonical query:
SELECT
application_name,
client_addr,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_behind_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_behind_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_behind_bytes
FROM pg_stat_replication;
The three columns mean: how many bytes the primary has written that have not yet been sent / flushed / replayed by the replica. replay_behind_bytes is the user-visible lag. flush_behind_bytes is the durability lag.
Alert thresholds I use:
replay_behind_bytesover 100MB on a synchronous replica.replay_behind_bytesover 1GB sustained for more than 5 minutes on any replica.- Any state other than
streamingfor more than a minute.
Signal 3: WAL retention by replication slots
Replication slots are the mechanism that lets a replica say "do not throw this WAL away yet, I still need it." When a replica disconnects but the slot is not dropped, the primary keeps WAL forever, and the disk fills up.
The query:
SELECT
slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
If a slot is active=false and its retained WAL is more than a few hundred MB, that slot is your incident in the making. Either bring the consumer back up or drop the slot.
This is by far the most common WAL-related production incident I have seen — a replica was decommissioned but its slot was not. Six months later, the disk filled.
The forgotten signal: archive failures
If you have archive_mode = on, every WAL segment is supposed to be archived (typically to S3 or a NAS) before being recycled. If archive is failing — wrong credentials, disk full on the destination, network issue — pg_stat_archiver shows it:
SELECT
archived_count,
failed_count,
last_archived_wal,
last_archived_time,
last_failed_wal,
last_failed_time
FROM pg_stat_archiver;
If failed_count is climbing, you have a quiet incident. WAL accumulates on the primary because it cannot be archived, the disk fills, and your backup chain is broken.
Alert: any last_failed_time more recent than last_archived_time.
How I lay out the dashboard
Four panels, in this order:
- WAL generation rate (bytes/sec, 24-hour and 7-day averages overlaid).
- Replication lag for each replica (replay_behind_bytes, with thresholds drawn).
- Replication slots, sorted by retained_wal descending.
- Archive failures (failed_count, with last_failed_time as a separate counter).
The alerts I attach:
- WAL generation rate over 2x baseline → page after 10 minutes.
- Replication lag over threshold → page after 5 minutes.
- Inactive slot retaining more than 500MB → page after 10 minutes.
- Any new archive failure → page immediately.
This layout has caught every WAL-related incident I have seen for the last three years. It is not exotic. It just covers the cases that the default disk_usage_percent alert misses until it is too late.