PostgreSQL Performance Triage Runbook | MonPG Docs

Use this when PostgreSQL is slow and the cause is not obvious

The goal of the first 15 minutes is to classify the incident before changing settings. Is the database executing expensive SQL, waiting on locks, waiting on I/O, blocked by old transactions, spilling to temp files, or simply overloaded with connections?

Take a snapshot first. Do not reset stats, restart services, or cancel random sessions until you know what evidence you are about to destroy.

Step 1: Count active work and waits

SELECT
  state,
  wait_event_type,
  wait_event,
  count(*) AS sessions
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
GROUP BY 1, 2, 3
ORDER BY sessions DESC;

If most sessions are waiting on locks, find blockers. If most are waiting on clients, check application consumption and network. If most are active with high CPU, move to query workload analysis.

Step 2: Find old transactions

SELECT
  pid,
  state,
  now() - xact_start AS xact_age,
  now() - query_start AS query_age,
  wait_event_type,
  wait_event,
  left(query, 180) AS query_sample
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start
LIMIT 20;

Old transactions can hold locks and prevent vacuum cleanup. An idle transaction may look harmless while keeping dead rows alive for everyone else.

Step 3: Rank query pain by impact

SELECT
  queryid,
  calls,
  round(total_exec_time::numeric, 2) AS total_ms,
  round(mean_exec_time::numeric, 2) AS mean_ms,
  round(stddev_exec_time::numeric, 2) AS stddev_ms,
  shared_blks_read,
  temp_blks_written,
  left(query, 180) AS query_sample
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Sort by total time first. A query that runs constantly can be more important than a single dramatic slow query.

Step 4: Check maintenance pressure

SELECT
  relname,
  n_live_tup,
  n_dead_tup,
  last_autovacuum,
  last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

High dead tuples, stale analyze timestamps, and old transactions often explain plan regressions and table scans that appeared suddenly.

Choose one intervention

Cancel a verified blocker if lock waits are the failure mode.
Pause a batch job if write or I/O pressure is causing the incident.
Run ANALYZE if stale statistics explain bad plans.
Add or test an index only after EXPLAIN confirms the missing access path.
Raise memory only for the session or role that needs it, not globally during panic.