Replication — Lag, Slots, and Standby Health — MonPG Docs

MonPG Replication page

Streaming replication

Pulls from pg_stat_replication on the primary. Each connected standby shows: application name, client address, state (streaming / catchup / startup); write lag (bytes the standby hasn't received yet); flush lag (bytes received but not fsync'd); replay lag (bytes flushed but not applied to the visible snapshot); and total lag in time, which is what you want for read-after-write reasoning.

Replication slots

From pg_replication_slots. Two patterns to watch. Orphaned slots — active=false for long stretches. These prevent WAL from being reclaimed, so disk fills. The fix is pg_drop_replication_slot('<name>') after confirming the consumer is really gone. Unbounded slot growth — slot lag in bytes over 10GB. The consumer (logical subscriber, Debezium, whatever) has fallen behind and WAL is piling up. Either fix the consumer or drop the slot if it's no longer needed.

WAL archiver

If WAL archiving is enabled (most managed providers do this automatically), MonPG tracks archiver state, last archived WAL, and archive error rate. A failing archiver means no PITR — and WAL accumulates locally instead of being shipped. Usually a misconfigured archive_command or a full S3 bucket. The error rate going non-zero is the canary; fix it before the local disk fills.

Built-in alerts

Replication lag over 1 minute (replay_lag > 60s). Any slot with active=false and over 1GB of lag. Archiver failed_count > 0 in the last hour. Tune these defaults under Alerts & Check-Up if your workload has different tolerances — analytical replicas often run with much higher acceptable lag than read-replicas serving user traffic.