Autovacuum usually gets blamed when it runs and ignored when it does not. That is backwards. A healthy autovacuum process is part of normal PostgreSQL performance. The scary version is the one that cannot keep up.
When cleanup falls behind, queries scan through more dead rows, indexes get fatter, statistics drift, and old transactions keep history alive. The symptom might be a slow query, but the cause is often maintenance debt.
I treat autovacuum as a performance system, not a background chore.
Dead tuples are work the query should not have to see
Updates and deletes leave old row versions behind until vacuum can remove them. If a table churns heavily and vacuum lags, normal reads carry extra weight.
SELECT
relname,
n_live_tup,
n_dead_tup,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;
Stale statistics can be as bad as bloat
Autovacuum also triggers analyze. If statistics are stale, the planner may misread selectivity and choose a bad plan. That is why a slow-query fix is sometimes ANALYZE, not a new index.
ANALYZE orders;
Long transactions block cleanup
A single long-running transaction can prevent vacuum from removing dead tuples that are newer than its snapshot. The app may look idle, but its old transaction keeps the table messy.
SELECT
pid,
state,
now() - xact_start AS xact_age,
left(query, 180) AS query_sample
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start
LIMIT 20;
Per-table tuning beats global panic
The write-heavy table that churns all day does not need the same thresholds as a reference table updated once a month. Tune the tables that create the problem.
ALTER TABLE events SET (
autovacuum_vacuum_scale_factor = 0.02,
autovacuum_analyze_scale_factor = 0.01,
autovacuum_vacuum_cost_limit = 2000
);
The incident checklist
- Find tables with high dead tuple counts.
- Check old transactions and replication slots.
- Check last autovacuum and last autoanalyze.
- Compare bad query plans before and after ANALYZE.
- Tune the hot table, not the whole cluster first.
- Watch whether vacuum duration and dead tuples actually improve.
The production habit
The teams that get good at PostgreSQL performance do not chase every knob. They turn a vague complaint into a named failure mode, collect one clean measurement, make one change, and then compare the next measurement against the first. That rhythm is slower than guessing for the first hour and much faster by the end of the incident.