The server with no free disk is one of the few Postgres outages where the symptom and the root cause are the same. Postgres cannot write WAL → cannot accept commits → application dies. There is no clever workaround. You either get the disk under control or you do not have a database.
I have been on the receiving end of this twice. Here is the playbook that came out of those experiences.
The first 10 minutes
Get a shell on the server and run, in order:
df -h /var/lib/postgresql
du -sh /var/lib/postgresql/*
du -sh /var/lib/postgresql/16/main/*
The top directory tells you which Postgres data path is full. The breakdown shows what's eating space. Usual culprits, in rough order of likelihood:
pg_wal/— WAL accumulating, not being recycled. Check why.base/— actual table data. Maybe a table grew unexpectedly.pg_log/(or wherever logs go) — log files filled up.pg_tblspc/— extra tablespaces.
If pg_wal is dominant, the next question is why WAL is not being recycled.
WAL not being recycled — the four reasons
-- 1. Replication slot stuck behind
SELECT slot_name, active, restart_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
If a slot has active = false and is retaining a lot of WAL, the consumer (a replica that is gone) is the cause. Drop the slot to free WAL:
SELECT pg_drop_replication_slot('slot_name_here');
This is the most common cause. A replica was retired, the slot was forgotten, WAL accumulated for weeks.
-- 2. Archive failures
SELECT failed_count, last_failed_wal, last_failed_time, last_archived_time
FROM pg_stat_archiver;
If last_failed_time is more recent than last_archived_time, the archive command is failing. WAL cannot be recycled until it is archived. Fix the archive destination (S3 credentials, disk full on archive target, network) and Postgres will catch up.
-- 3. Long-running transactions blocking xmin
SELECT pid, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND xact_start < now() - interval '1 hour'
ORDER BY xact_start;
A stuck transaction holds an old xmin, preventing vacuum and WAL recycling. Kill it: SELECT pg_terminate_backend(pid);.
-- 4. Checkpoint not running
SHOW max_wal_size;
SELECT * FROM pg_stat_bgwriter;
If max_wal_size is configured very high relative to the disk, WAL can accumulate beyond what the disk fits. Lower it temporarily — ALTER SYSTEM SET max_wal_size = '4GB'; SELECT pg_reload_conf(); and force a checkpoint: CHECKPOINT;.
Emergency space recovery
If the disk is at 99% and you cannot get even a small write through, you need space immediately. The order of options:
- Truncate logs. PostgreSQL log files are usually under
/var/log/postgresql/.truncate -s 0the largest one (do not delete it; Postgres has the file handle open). - Move logs to a different volume. If you have one.
mv /var/log/postgresql /mnt/other_disk/postgresql && ln -s /mnt/other_disk/postgresql /var/log/postgresql. - Drop replication slots. Frees WAL immediately.
- Vacuum a temp-heavy table to reclaim its bloat. This actually requires Postgres to write WAL, so it only works if you have a few hundred MB free.
- Last resort: extend the volume. If on cloud (RDS, Aurora, GCP), this is one click but takes a few minutes. On bare metal, this is hardware work.
Do not delete files in pg_wal directly. Even if it looks like obvious old WAL. Postgres does not track files via filesystem; deleting the wrong segment corrupts the cluster. Use pg_archivecleanup if you must, and only if you fully understand the situation.
Once Postgres is breathing again
With free space restored, the immediate crisis is over. The follow-up:
- Confirm vacuum is running. A long-blocked vacuum will take a while to catch up.
- Confirm WAL is recycling.
pg_wal/should stop growing. - Confirm replication is healthy.
pg_stat_replication. - Confirm the application is reconnecting. Some pools take a minute to retry.
Then the postmortem.
What I check after every incident
The disk-full incident usually has a root cause that was building for weeks. The check I run for prevention:
- Disk usage trend over the last 30 days. Was the growth linear (predictable) or sudden (a runaway job)? Linear means I can extrapolate to the next outage and plan capacity.
- WAL retention. Is any slot retaining more than 1GB? Set up a per-slot alert.
- Archive failure rate. Is the archiver failing periodically? If yes, the next disk-full is coming.
- Largest growers. What table or index doubled in size in the last week? Sometimes a forgotten cron job inserts more than expected.
Prevention that actually works
The alerts I keep:
- Disk usage above 80% → page warn.
- Disk usage above 90% → page now.
- Replication slot retaining >500MB and inactive >1 hour → page.
- Archive failure with
last_failed_timenewer thanlast_archived_time→ page. - WAL accumulation rate exceeds capacity to ship to archive → page.
These cover the common paths to disk-full. Each one fires before the database is down, not while the application is already failing.
What you cannot prevent
A single application bug that writes 100x more data than expected can fill a disk in minutes. No amount of monitoring catches it before impact. The defense for that is:
- Disk capacity 2-3x what you think you need.
- Per-workload disk usage attribution (which tables grew the most).
- A retention or partitioning strategy so the worst case is bounded.
And backups, always. Restore drills, periodically. The day a disk fills and you cannot extend it fast enough, the answer is to restore to bigger storage. That only works if the backups work.
The preparation is more important than the command you run at 2 a.m. I like writing this down because every team I have helped through it had the same realization afterward: most of the work that matters happens months earlier.