7 min read

The Postgres Health Check I Run Before Blaming the App

Before the application team and the database team start arguing, run this 90-second sweep. Half the time it ends the discussion.

Application metrics look bad. The on-call engineer is starting to type "the database is slow." The database engineer is opening a ticket that says "please give me a query to look at." Neither side has data yet.

This is the moment for a fast health check. Six categories, six queries. Most of the time, it tells me whether the problem is in the database or somewhere else. Either way, the conversation moves forward.

1. Connection pressure

SELECT
  state,
  count(*) AS conns,
  count(*) FILTER (WHERE state = 'active') AS active,
  count(*) FILTER (WHERE state = 'idle') AS idle,
  count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_tx,
  count(*) FILTER (WHERE state = 'idle in transaction (aborted)') AS idle_aborted
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY state
ORDER BY conns DESC;

What I am looking for:

  • active climbing — queries running. Some load is normal. A spike out of proportion to traffic is the application doing more work or the queries getting slower.
  • idle in transaction — connections holding open transactions and not doing anything. This is almost always a bug. It blocks vacuum, holds locks, and pollutes the connection pool.
  • Total connections near max_connections — pool exhaustion. Application is going to start failing soon.

2. Long-running queries

SELECT
  pid,
  now() - xact_start AS duration,
  state,
  query
FROM pg_stat_activity
WHERE state != 'idle'
  AND xact_start < now() - interval '30 seconds'
ORDER BY xact_start;

Queries running for more than 30 seconds. If they are not analytical (they should be), this is the slowness candidate. Note the pid — you may need to kill it.

-- If you are sure:
SELECT pg_terminate_backend(pid_here);

Kill is preferable to cancel for stuck transactions; cancel only stops the current statement, not the surrounding transaction.

3. Lock waits

SELECT
  blocked.pid AS blocked_pid,
  blocked.usename AS blocked_user,
  blocking.pid AS blocking_pid,
  blocking.usename AS blocking_user,
  blocked.query AS blocked_query,
  blocking.query AS blocking_query
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE NOT blocked.pid = blocking.pid;

If this returns rows, the database has lock contention. Common causes: a migration running in the background, a SELECT FOR UPDATE that the application forgot to commit, a manual ALTER TABLE someone is running interactively. The blocking pid is the culprit; the blocked pid is the symptom.

4. Slow queries from pg_stat_statements

SELECT
  round(total_exec_time::numeric, 0) AS total_ms,
  calls,
  round(mean_exec_time::numeric, 2) AS mean_ms,
  rows,
  query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

The top 20 queries by cumulative time. The fix order is usually:

  • High total_ms and high calls — query is called often and adds up. Indexes, query rewrites, or caching.
  • High total_ms and low calls — one bad query. Possibly missing index, possibly bad plan, possibly a query that should not exist.
  • High mean_ms — slow per-call. Look for missing indexes or plan regressions.

5. Replication health

SELECT
  application_name,
  state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_behind
FROM pg_stat_replication;

If any replica is behind by more than a few seconds in normal operation, something is wrong. Likely a heavy job on the primary saturating WAL transfer, or a slow replica.

6. Autovacuum and bloat

SELECT
  schemaname,
  relname,
  n_live_tup,
  n_dead_tup,
  round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
  last_autovacuum,
  last_analyze
FROM pg_stat_user_tables
WHERE n_live_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 20;

Look for:

  • dead_pct over 30% on tables you care about — vacuum is not keeping up.
  • last_autovacuum more than a few days ago on busy tables — autovacuum may be tuned too conservatively.
  • last_analyze very stale — query plans may be working with bad statistics.

What I do with the results

After running these six in 90 seconds, I usually have one of three answers:

  1. The database is fine. Connections normal, no slow queries, no lock waits, replication green, autovacuum healthy. The problem is in the application or the network. The conversation moves out of database land.

  2. The database has a clear problem. A single long-running query, a stuck transaction, a runaway connection count, a vacuum-starved table. We have a target.

  3. The database is sick in a vague way. Everything is slightly elevated. This is the actual hard case — usually it means either the underlying hardware is constrained, or the workload has shifted faster than the database can absorb. The next step is sustained monitoring and a tuning plan, not a quick fix.

The value of the sweep is that it converts the vague "database is slow" into one of these three answers. From there, the path forward is concrete.