Postgres Memory Pressure: Diagnosing OOMs and the Settings That Prevent Them
A Postgres OOM kill is one of the few crashes that is almost always preventable. The pattern is consistent enough to have a checklist.
Notes for the problems that show up after launch: bad plans, awkward migrations, index debt, vacuum pressure, replica lag, and the small decisions that make PostgreSQL easier to operate.
A Postgres OOM kill is one of the few crashes that is almost always preventable. The pattern is consistent enough to have a checklist.
Temp files are hidden disk work. They explain slow sorts, hash joins, and aggregations that look fine until work_mem runs out under real concurrency.
Bad plans usually start with bad row estimates. Fix the first wrong estimate and the rest of the plan often stops looking mysterious.
Slow inserts are rarely just inserts. They are usually index maintenance, constraint and trigger work, WAL/checkpoint pressure, or a transaction pattern that makes every row pay retail.
Slow deletes usually come from the work around the row: foreign keys, triggers, indexes, WAL, vacuum debt, and transaction size. The fix starts by finding what each deleted row has to pay for.
Wraparound emergencies are preventable, but once warnings start you need a calm runbook: identify old XIDs, unblock vacuum, freeze priority tables, and protect availability.
Plan regressions are painful because the SQL did not change. The work is proving the plan changed, finding the estimate or stats shift, and restoring the safe path.
psql is more capable than people use it for. A handful of meta-commands and shortcuts make it the most productive shell for database work.
pg_stat_statements is cumulative since cluster start or last reset. If you reset it, you lose the data you would have used to debug the next incident.
A useful Postgres health check is not a wall of green checks. It is a short path from symptom to evidence: sessions, locks, slow SQL, vacuum, replication, and WAL.
auto_explain is for the slow plan you cannot reproduce later. It captures the execution plan when the bad thing actually happens.
pgbench measures Postgres throughput under a synthetic workload. It tells you something useful, but only if you understand what its numbers mean.