January 14, 20266 min read

Autovacuum Wraparound Emergency: What to Do When the Cluster is at Risk

Transaction ID wraparound is rare and dramatic. When you see the warnings, you have time to act, but not a lot. Here is the runbook.

The database is not accepting commands to avoid wraparound data loss error is one of the few Postgres errors that takes the cluster down hard. It means transaction IDs have come dangerously close to wrapping around, and Postgres has refused new transactions to prevent data corruption.

This is a rare, preventable, and recoverable situation. Here is the runbook for when it happens, and the prevention steps to make sure it does not happen again.

What is wraparound

Postgres uses 32-bit transaction IDs. After ~2 billion transactions, the IDs wrap around. The MVCC system uses transaction ID comparisons to determine row visibility — when IDs wrap, those comparisons break. Old rows could appear new, current rows could appear old. Data is not actually corrupted, but visibility logic is broken.

To prevent this, Postgres tracks datfrozenxid — the oldest transaction ID still possibly visible. When this gets close to wraparound, Postgres triggers aggressive autovacuum runs that mark old rows as "frozen" (visibility no longer depends on their transaction ID).

The warnings:

At 1.5 billion transactions of age: warning logs about wraparound.
At 1.9 billion: aggressive autovacuum runs forced.
At 2 billion - 1 million: cluster refuses new transactions to prevent data loss.

The diagnostic

The canonical query:

SELECT
  datname,
  age(datfrozenxid) AS xid_age,
  2147483647 - age(datfrozenxid) AS xids_until_wraparound
FROM pg_database
ORDER BY age(datfrozenxid) DESC;

xid_age over 100 million: monitor.
Over 1 billion: investigate immediately.
Over 1.5 billion: emergency.
Over 1.9 billion: critical.

For per-table:

SELECT
  schemaname, relname,
  age(relfrozenxid) AS table_age,
  pg_size_pretty(pg_relation_size(format('%I.%I', schemaname, relname)::regclass)) AS size,
  last_autovacuum, last_autovacuum_xid_age
FROM pg_stat_user_tables
ORDER BY age(relfrozenxid) DESC
LIMIT 20;

The biggest tables with the oldest relfrozenxid are the candidates.

What to do when warnings appear

If xid_age is climbing toward the danger zone but the cluster is still up:

1. Identify what is preventing autovacuum from finishing.

-- Long-running transactions hold xmin, blocking vacuum
SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE backend_type = 'client backend'
  AND state != 'idle'
  AND xact_start < now() - interval '5 minutes'
ORDER BY xact_start;

Kill long-running transactions. They prevent vacuum from making progress.

-- Replication slots holding xmin
SELECT slot_name, active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained
FROM pg_replication_slots
WHERE NOT active;

Drop inactive slots that should not be retaining state.

-- Idle in transaction connections
SELECT pid, application_name, query, now() - state_change AS idle_for
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND now() - state_change > interval '5 minutes';

Kill idle-in-transaction connections.

2. Speed up autovacuum.

ALTER SYSTEM SET autovacuum_max_workers = 6;
ALTER SYSTEM SET autovacuum_vacuum_cost_delay = 0;
ALTER SYSTEM SET autovacuum_vacuum_cost_limit = 10000;
ALTER SYSTEM SET maintenance_work_mem = '2GB';
SELECT pg_reload_conf();

More workers, less throttling, more memory per worker. Vacuum runs faster.

3. Manually freeze the worst tables.

VACUUM (FREEZE, VERBOSE) schema.large_table;

This forces a freeze on all tuples in the table. On a multi-TB table, this can take hours, but it is non-blocking (takes a ShareUpdateExclusiveLock).

Run against the tables with the highest relfrozenxid age first. Each freeze advances the cluster's datfrozenxid once finished.

What to do when the cluster is refusing transactions

If you see database is not accepting commands to avoid wraparound data loss:

1. Stop the application. New transaction attempts will fail. Stop them so the cluster can focus on vacuum.

2. Connect as superuser in single-user mode (last resort). This is documented in the Postgres docs: shut down the cluster, start postgres --single against the affected database, run VACUUM. This bypasses the wraparound check.

In practice, you can usually avoid single-user mode by:

Stopping the application (so no new transactions).
Connecting as a superuser (these connections are still allowed even when others are blocked).
Running VACUUM (FREEZE) against the affected tables.

3. Once xid_age drops, re-enable application traffic.

Prevention

After the emergency, the prevention work:

1. Monitor xid_age proactively. Alert at 500 million age. Investigate every alert.

2. Tune autovacuum to keep up.

ALTER SYSTEM SET autovacuum_freeze_max_age = 200000000;  -- default, do not raise
ALTER SYSTEM SET vacuum_freeze_table_age = 150000000;
ALTER SYSTEM SET vacuum_freeze_min_age = 50000000;

The most important rule: never raise autovacuum_freeze_max_age above the default to "give vacuum a break." That just postpones the problem until it is much worse.

3. Ensure no chronic blockers.

Set idle_in_transaction_session_timeout = '5min'. Kills connections that hold transactions open without doing work.
Monitor replication slots; drop inactive ones.
Document which long-running queries are legitimate; question others.

4. For very large tables, consider table-level overrides:

ALTER TABLE big_events SET (
  autovacuum_vacuum_scale_factor = 0.02,
  autovacuum_freeze_min_age = 10000000
);

More aggressive freeze for tables that are hot or large.

A real recovery

A team I helped had a 5TB table that had not been vacuumed in months because of a long-running pg_dump that held an xmin. The cluster's xid_age was at 1.7 billion. They had 300 million transactions before refusal.

The steps:

Killed the pg_dump (it had been forgotten about).
Tuned autovacuum: vacuum_cost_delay = 0, max_workers = 8, maintenance_work_mem = 4GB.
Started VACUUM (FREEZE) big_table; in a screen session.
Watched xid_age and pg_stat_progress_vacuum to track progress.

The vacuum took 14 hours. The xid_age dropped from 1.7 billion to 200 million by the time it finished. Crisis averted.

The fix afterward:

Added monitoring on long-running transactions.
Reduced default autovacuum thresholds for the big table.
Set idle_in_transaction_session_timeout.
Quarterly review of replication slots.

They have not had a wraparound scare since.

What I commit to

For any production Postgres:

Monitor xid_age per database. Alert at 500M.
Set idle_in_transaction_session_timeout.
Monitor replication slots; drop inactive ones promptly.
Keep autovacuum cost delay low (0-2ms).
For very large tables, table-level autovacuum tuning.
Annual fire drill: simulate a stuck vacuum, verify the alert fires and the response works.

Wraparound is one of those events that should be impossible if monitoring and tuning are right. The teams that hit it were almost always operating on default settings without the monitoring to catch the slow climb.

← Back to all articles