Application metrics look bad. The on-call engineer is starting to type "the database is slow." The database engineer is opening a ticket that says "please give me a query to look at." Neither side has data yet.
This is the moment for a fast health check. Six categories, six queries. Most of the time, it tells me whether the problem is in the database or somewhere else. Either way, the conversation moves forward.
1. Connection pressure
SELECT
state,
count(*) AS conns,
count(*) FILTER (WHERE state = 'active') AS active,
count(*) FILTER (WHERE state = 'idle') AS idle,
count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_tx,
count(*) FILTER (WHERE state = 'idle in transaction (aborted)') AS idle_aborted
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY state
ORDER BY conns DESC;
What I am looking for:
activeclimbing — queries running. Some load is normal. A spike out of proportion to traffic is the application doing more work or the queries getting slower.idle in transaction— connections holding open transactions and not doing anything. This is almost always a bug. It blocks vacuum, holds locks, and pollutes the connection pool.- Total connections near
max_connections— pool exhaustion. Application is going to start failing soon.
2. Long-running queries
SELECT
pid,
now() - xact_start AS duration,
state,
query
FROM pg_stat_activity
WHERE state != 'idle'
AND xact_start < now() - interval '30 seconds'
ORDER BY xact_start;
Queries running for more than 30 seconds. If they are not analytical (they should be), this is the slowness candidate. Note the pid — you may need to kill it.
-- If you are sure:
SELECT pg_terminate_backend(pid_here);
Kill is preferable to cancel for stuck transactions; cancel only stops the current statement, not the surrounding transaction.
3. Lock waits
SELECT
blocked.pid AS blocked_pid,
blocked.usename AS blocked_user,
blocking.pid AS blocking_pid,
blocking.usename AS blocking_user,
blocked.query AS blocked_query,
blocking.query AS blocking_query
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE NOT blocked.pid = blocking.pid;
If this returns rows, the database has lock contention. Common causes: a migration running in the background, a SELECT FOR UPDATE that the application forgot to commit, a manual ALTER TABLE someone is running interactively. The blocking pid is the culprit; the blocked pid is the symptom.
4. Slow queries from pg_stat_statements
SELECT
round(total_exec_time::numeric, 0) AS total_ms,
calls,
round(mean_exec_time::numeric, 2) AS mean_ms,
rows,
query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
The top 20 queries by cumulative time. The fix order is usually:
- High
total_msand highcalls— query is called often and adds up. Indexes, query rewrites, or caching. - High
total_msand lowcalls— one bad query. Possibly missing index, possibly bad plan, possibly a query that should not exist. - High
mean_ms— slow per-call. Look for missing indexes or plan regressions.
5. Replication health
SELECT
application_name,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_behind
FROM pg_stat_replication;
If any replica is behind by more than a few seconds in normal operation, something is wrong. Likely a heavy job on the primary saturating WAL transfer, or a slow replica.
6. Autovacuum and bloat
SELECT
schemaname,
relname,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
last_autovacuum,
last_analyze
FROM pg_stat_user_tables
WHERE n_live_tup > 1000
ORDER BY n_dead_tup DESC
LIMIT 20;
Look for:
dead_pctover 30% on tables you care about — vacuum is not keeping up.last_autovacuummore than a few days ago on busy tables — autovacuum may be tuned too conservatively.last_analyzevery stale — query plans may be working with bad statistics.
What I do with the results
After running these six in 90 seconds, I usually have one of three answers:
The database is fine. Connections normal, no slow queries, no lock waits, replication green, autovacuum healthy. The problem is in the application or the network. The conversation moves out of database land.
The database has a clear problem. A single long-running query, a stuck transaction, a runaway connection count, a vacuum-starved table. We have a target.
The database is sick in a vague way. Everything is slightly elevated. This is the actual hard case — usually it means either the underlying hardware is constrained, or the workload has shifted faster than the database can absorb. The next step is sustained monitoring and a tuning plan, not a quick fix.
The value of the sweep is that it converts the vague "database is slow" into one of these three answers. From there, the path forward is concrete.