6 min read

WAL Monitoring in Postgres: What to Watch Before Disk Becomes the Story

WAL is fine when it is fine and a crisis when it is not. Three signals tell you the difference, and they are not the ones the dashboards usually surface.

WAL is the write-ahead log. It is also, when things go wrong, the fastest way to fill a disk and force a database into recovery. The standard monitoring story I have inherited at most companies covers WAL volume but misses three signals that actually predict outages.

Here is the layout I have ended up with, and the thresholds I default to.

Signal 1: WAL generation rate

The basic metric. How fast is the database producing WAL bytes per second? This is your raw write throughput from a durability point of view.

SELECT pg_walfile_name_offset(pg_current_wal_lsn());

For delta, sample pg_current_wal_lsn() at intervals and subtract:

-- At time T1
SELECT pg_current_wal_lsn() AS lsn;
-- At time T2 (say, 60 seconds later)
SELECT pg_current_wal_lsn() - 'previous_lsn'::pg_lsn AS bytes_in_60s;

A sustained climb in WAL rate without a corresponding climb in application traffic is usually one of: a misbehaving job (mass UPDATEs that the application thinks are idempotent), a forgotten DDL replay, or a runaway autovacuum (FREEZE generates a lot of WAL).

Alert threshold: anything more than 2x the rolling 7-day average for more than 10 minutes deserves attention.

Signal 2: Replication lag

WAL is fine when it gets shipped to replicas in time. WAL is a problem when replicas fall behind, because:

  • The primary cannot recycle WAL segments that replicas still need.
  • Disk usage climbs as a function of the lag.
  • A failover during this window means data loss equal to the lag.

The canonical query:

SELECT
  application_name,
  client_addr,
  state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_behind_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_behind_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_behind_bytes
FROM pg_stat_replication;

The three columns mean: how many bytes the primary has written that have not yet been sent / flushed / replayed by the replica. replay_behind_bytes is the user-visible lag. flush_behind_bytes is the durability lag.

Alert thresholds I use:

  • replay_behind_bytes over 100MB on a synchronous replica.
  • replay_behind_bytes over 1GB sustained for more than 5 minutes on any replica.
  • Any state other than streaming for more than a minute.

Signal 3: WAL retention by replication slots

Replication slots are the mechanism that lets a replica say "do not throw this WAL away yet, I still need it." When a replica disconnects but the slot is not dropped, the primary keeps WAL forever, and the disk fills up.

The query:

SELECT
  slot_name,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

If a slot is active=false and its retained WAL is more than a few hundred MB, that slot is your incident in the making. Either bring the consumer back up or drop the slot.

This is by far the most common WAL-related production incident I have seen — a replica was decommissioned but its slot was not. Six months later, the disk filled.

The forgotten signal: archive failures

If you have archive_mode = on, every WAL segment is supposed to be archived (typically to S3 or a NAS) before being recycled. If archive is failing — wrong credentials, disk full on the destination, network issue — pg_stat_archiver shows it:

SELECT
  archived_count,
  failed_count,
  last_archived_wal,
  last_archived_time,
  last_failed_wal,
  last_failed_time
FROM pg_stat_archiver;

If failed_count is climbing, you have a quiet incident. WAL accumulates on the primary because it cannot be archived, the disk fills, and your backup chain is broken.

Alert: any last_failed_time more recent than last_archived_time.

How I lay out the dashboard

Four panels, in this order:

  1. WAL generation rate (bytes/sec, 24-hour and 7-day averages overlaid).
  2. Replication lag for each replica (replay_behind_bytes, with thresholds drawn).
  3. Replication slots, sorted by retained_wal descending.
  4. Archive failures (failed_count, with last_failed_time as a separate counter).

The alerts I attach:

  • WAL generation rate over 2x baseline → page after 10 minutes.
  • Replication lag over threshold → page after 5 minutes.
  • Inactive slot retaining more than 500MB → page after 10 minutes.
  • Any new archive failure → page immediately.

This layout has caught every WAL-related incident I have seen for the last three years. It is not exotic. It just covers the cases that the default disk_usage_percent alert misses until it is too late.