Replication slots solve a real problem: ensuring a replica can catch up after a network blip without falling off the back of the WAL. They also produce one of the most common preventable Postgres outages: a forgotten slot retaining WAL until the disk fills.
I have helped four different teams recover from this. The bug is always the same: a replica was decommissioned, the slot was not. Months later, the disk fills.
What replication slots do
Without a slot, the primary recycles WAL segments based on max_wal_size. If a replica is offline when the primary recycles a segment, the replica cannot catch up. It needs a fresh base backup.
With a slot, the primary tracks the replica's position. WAL is retained until the replica acknowledges receipt. The replica can disconnect, come back, and resume cleanly.
This is genuinely useful. Networks blip; replicas restart; clusters reconfigure. Without slots, every minor disconnect can require a full re-sync.
The setup
On the primary:
SELECT pg_create_physical_replication_slot('replica_1');
On the replica, in postgresql.conf (or via pg_basebackup -S replica_1 at setup time):
primary_slot_name = 'replica_1'
The replica connects, identifies itself by slot name, and the primary tracks its position.
How they fail
The failure mode that bites everyone:
- A replica is decommissioned (server retired, environment torn down, region closed).
- The slot is left behind on the primary.
- WAL accumulates indefinitely because the slot still says "this replica might come back."
- The primary's disk fills.
- The application goes down.
The critical query for monitoring:
SELECT
slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
What you want to see: active = true for every slot, retained near zero (under a few hundred MB).
What is wrong: any slot with active = false and growing retained.
The fix when a slot is abandoned
If you confirm a replica is gone for good, drop the slot:
SELECT pg_drop_replication_slot('replica_1');
This releases the WAL retention. Postgres recycles the segments at the next checkpoint. Disk usage drops back to normal.
The critical word is "confirm." If a replica is just temporarily disconnected, dropping the slot makes its catch-up impossible — it would need a full re-sync. Verify the replica is gone before dropping.
Monitoring is the prevention
The alert that matters:
- Any slot with active = false AND retained > 500MB → page.
- Any slot with active = false AND retained > 1GB → page urgent.
This catches abandonment before it becomes an outage. Most teams have monitoring on disk usage, which catches the problem at the last possible moment. Monitoring on slot retention catches it weeks earlier.
When slots are inappropriate
For short-lived replicas (analytics replicas that come and go, ephemeral test environments), the cost of cleaning up slots may exceed the benefit of guaranteed catch-up. For those, consider not using slots — accept that a disconnect requires a fresh base backup.
# Replica without slot — disconnect means re-syncing
primary_slot_name = ''
This is the right choice for replicas that are easily recreated.
Slot vs WAL archive
A related question: do you need slots if you also have WAL archiving?
WAL archiving ships WAL to durable storage (S3). A replica that has been disconnected can replay WAL from the archive when it reconnects. In theory, archiving is a substitute for slots.
In practice, archive replay is slower and more complex than slot-based replication. For long-running replicas, slots are simpler. For occasional replicas or backups, archiving is enough.
Most production setups have both: slots for active replicas, archiving for backup/PITR.
What I commit to for slot management
- Naming convention: every slot is named for its consumer (
replica_us_west_1,replica_analytics, etc.). - Documentation: a list of what each slot is for, who owns it, when it can be dropped.
- Quarterly review: every slot's
activestate andretainedsize. Anything inactive over 1 hour gets investigated. - Pre-decommission ritual: when retiring a replica, the slot is the LAST thing dropped, and it is dropped within minutes of the replica being gone.
This is operational discipline, not Postgres tuning. The technology is sound; the human-process side is where the failures come from.
A real story
One of the teams I helped had this exact setup: a slot for a development replica that had been turned off six months earlier. WAL retention was 800GB. Disk was at 95%.
We verified the dev replica was unreachable, then:
SELECT pg_drop_replication_slot('dev_replica');
CHECKPOINT;
Within 10 minutes, disk dropped from 95% to 35%. The application recovered without intervention.
The team also added monitoring on slot retention. Six months later, the same problem started — a different abandoned slot — but the alert fired at 500MB retention, weeks before it would have been a problem. They dropped it in five minutes.
The difference between a quiet outage and a non-event is monitoring.