February 23, 20265 min read

When Replication Slots Eat the Disk: A Diagnostic Walkthrough

If your Postgres disk is growing and you cannot identify the culprit, replication slots are usually the answer. Here is the diagnostic sequence.

The most common pattern of unexplained Postgres disk growth has one cause: a replication slot retaining WAL because its consumer is gone or far behind. The diagnostic is fast — five minutes from "why is the disk growing" to "here is the slot to drop."

Here is the sequence I run.

Step 1: Confirm WAL is the issue

du -sh /var/lib/postgres/16/main/*

If pg_wal/ is the dominant directory and it is large (>>2x max_wal_size), the issue is WAL retention.

Step 2: Find the slot retaining the most WAL

SELECT
  slot_name,
  active,
  active_pid,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained,
  age(restart_lsn::pg_lsn::text::pg_lsn) AS oldest_xid_age
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

The top slot in this list is your candidate. Look at:

active = true: someone is connected. The slot is doing its job; the consumer is just behind.
active = false: nobody is connected. The slot is retaining WAL for a consumer that is not there.

Step 3: If the slot is active, the consumer is behind

For an active slot retaining a lot of WAL, the consumer (a replica or logical subscriber) is falling behind faster than it can catch up. Find it:

SELECT
  application_name,
  client_addr,
  state,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_behind
FROM pg_stat_replication;

The replica's replay_behind matches the slot's retained. The diagnostic question is why the replica cannot catch up:

Network bandwidth saturated: the replica's link cannot pull WAL fast enough. Check network metrics.
Replica is CPU-bound: applying WAL takes more CPU than the replica has. Upgrade or distribute load.
Replica has a slow query holding the apply process: kill the slow query.
WAL generation rate spiked: a runaway job on the primary. Throttle it.

Once the consumer catches up, the slot's retained WAL drops to near-zero.

Step 4: If the slot is inactive, the consumer is gone

For active = false, the consumer is not connected. Two possibilities:

1. Temporary disconnect. The replica will come back; we should let the slot continue to retain WAL.

Verify by checking when it was last connected. There is no direct timestamp in pg_replication_slots, but you can correlate with pg_stat_replication history if you have it, or check the replica server itself.

2. Permanent abandonment. The replica is gone (decommissioned, deleted, retired). The slot serves no purpose.

If permanent, drop the slot:

SELECT pg_drop_replication_slot('slot_name_here');

WAL retention is released. After the next checkpoint, the WAL files are recycled.

Step 5: Force a checkpoint to reclaim space

Dropping the slot frees retention but does not immediately delete WAL files. Force it:

CHECKPOINT;

Within a few seconds, pg_wal/ shrinks back to normal size (max_wal_size plus a few extra segments).

A real diagnostic

Real incident I worked on:

$ du -sh /var/lib/postgres/16/main/*
400G    pg_wal/
50G     base/
# ... etc

400GB of WAL. Default max_wal_size = 1GB. Something is retaining WAL.

SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;

  slot_name        | active | retained
 -------------------|--------|----------
  dev_replica_2024  | f      | 380 GB
  prod_replica_us   | t      | 1.2 GB

The abandoned dev_replica_2024 slot was the culprit. Verified the dev replica was gone (the entire dev environment had been torn down 4 months earlier).

SELECT pg_drop_replication_slot('dev_replica_2024');
CHECKPOINT;

Disk usage dropped from 95% to 25% in about 2 minutes. Application stayed up the whole time.

Total time from page to resolution: 12 minutes.

Prevention

The alert that prevents this:

Alert: any pg_replication_slots row with active = false AND retained > 500 MB → page.

This fires weeks before the disk fills. The window between alert and crisis is comfortable; the team has time to verify the consumer is gone and drop the slot deliberately.

Without this alert, the disk-fills-up signal is the disk-usage alert at 90%, by which time you are minutes from the database being unable to commit.

The lesson

Replication slots are a feature that requires accompanying operational hygiene. Most Postgres setups inherit slots without knowing it (especially on managed services with read replicas). The slots work correctly until they don't, and the failure mode is silent for a long time.

For anyone running Postgres in production:

Know what slots exist. List them. Know what each is for.
Monitor every slot's retention.
Drop slots when their consumers are gone, and do it the same day, not when the disk fills.

This is fifteen minutes of setup. It prevents the entire class of replication-slot-fills-disk incidents.

← Back to all articles