Replication and WAL9 min read

WAL Monitoring in Postgres: What to Watch Before Disk Becomes the Story

WAL problems usually look like disk problems too late. Monitor generation rate, checkpoints, archiving, replication lag, and slot retention before pg_wal owns the incident.

The disk alert said 92%. The database alert said nothing useful. Traffic was normal, CPU was fine, and the application team had already stopped the batch job. The disk kept filling anyway.

The cause was an inactive replication slot retaining WAL. PostgreSQL was doing the safe thing: keeping log files a consumer might still need. The monitoring was doing the unsafe thing: treating WAL as just disk usage.

WAL is not a background detail. It is the write path, the recovery path, the replication path, and often the first place write-heavy systems show pain.

The framework: watch generation, retention, and drainage

I split WAL monitoring into three questions:

  • How fast are we generating WAL?
  • What forces WAL to stay on disk?
  • How fast are replicas, archives, and slots draining it?

Disk usage is the last symptom. These are the earlier signals.

Measure WAL generation rate

On modern PostgreSQL, pg_stat_wal gives direct counters.

SELECT
  wal_records,
  wal_fpi,
  pg_size_pretty(wal_bytes) AS wal_written,
  stats_reset
FROM pg_stat_wal;

Take deltas over time. A single value tells you almost nothing. A sudden jump in WAL bytes per minute tells you an import, index build, vacuum-heavy period, or write path changed.

Check replication slots before blaming storage

SELECT
  slot_name,
  slot_type,
  active,
  restart_lsn,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;

A replication slot can retain WAL until its consumer catches up. That is the correct behavior for replication safety. It is also how a forgotten logical slot fills a disk.

If a slot is inactive and retained WAL is growing, treat it as an incident before disk is critical. Decide whether the consumer will recover or whether the slot should be dropped.

Watch replicas as a drain, not just a status

SELECT
  application_name,
  state,
  sync_state,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS send_lag,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
FROM pg_stat_replication;

A replica can be connected and still falling behind. If replay lag grows while WAL generation is high, the primary may eventually carry more WAL than expected.

Archive failure is another retention source

SELECT
  archived_count,
  failed_count,
  last_archived_wal,
  last_archived_time,
  last_failed_wal,
  last_failed_time
FROM pg_stat_archiver;

If archiving is enabled and failing, WAL can accumulate while the database waits for archive success. Do not silence this with a fake archive command unless you have deliberately chosen to break point-in-time recovery.

max_wal_size is a soft limit

PostgreSQL documentation calls max_wal_size a soft limit. WAL can exceed it under heavy load, failed archiving, or retention requirements such as replication slots. That is why "but max_wal_size is 4GB" is not a disk safety plan.

Use it to shape checkpoint behavior, not as the only guardrail against disk exhaustion.

Checkpoint pressure changes write latency

Frequent checkpoints can create write spikes. PostgreSQL can log warnings when checkpoints happen too close together, which is often a sign that max_wal_size is too small for the workload.

SHOW checkpoint_timeout;
SHOW checkpoint_completion_target;
SHOW max_wal_size;

When insert latency gets spiky, look at checkpoints and WAL generation together. The symptom may be commit latency, but the cause can be checkpoint IO.

What I alert on

  • WAL bytes generated per minute.
  • Replication slot retained bytes.
  • Inactive slots retaining WAL.
  • Replica replay lag in bytes and time.
  • Archive failures and time since last archive success.
  • Disk free percentage for the WAL volume or managed storage metric.
  • Checkpoint frequency and checkpoint-related write latency.

The pragmatic default

Monitor WAL like a pipeline. Writes generate it. Replicas and archives consume it. Slots can retain it. Checkpoints shape the IO. Disk usage is the final scoreboard, not the early warning.