The worst incident meetings start with a guess. Someone says the app is slow. Someone else says the database is fine because CPU is not high. Ten minutes disappear before anyone asks the database what it is actually doing.
A production health check should be small enough to run under pressure. It is not a dashboard with 80 panels. It is the first pass that tells you where to look next.
The framework: find the pressure class
I try to classify the problem quickly:
- Connection pressure: too many sessions or pooled clients stuck.
- Lock pressure: requests waiting behind one transaction.
- Query pressure: a few plans consuming most time.
- Maintenance pressure: vacuum, bloat, or wraparound risk.
- Replication or WAL pressure: writes cannot drain fast enough.
Once you know the class, the next step becomes much cheaper.
Start with active sessions
SELECT
state,
wait_event_type,
wait_event,
count(*) AS sessions
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
GROUP BY 1, 2, 3
ORDER BY sessions DESC;
This tells you whether the database is running queries, waiting on locks, waiting on IO, or mostly holding idle connections. It also catches a common false alarm: a database with low CPU but many sessions waiting on one lock can still be down from the user's point of view.
Find the oldest transaction
SELECT
pid,
usename,
application_name,
state,
now() - xact_start AS xact_age,
now() - query_start AS query_age,
wait_event_type,
wait_event,
left(query, 200) AS query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start
LIMIT 10;
Old transactions create strange secondary symptoms. They can hold locks, delay vacuum cleanup, preserve dead tuples, and make an otherwise normal system feel sticky.
Check lock blockers
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocker.pid AS blocker_pid,
blocker.query AS blocker_query,
now() - blocker.xact_start AS blocker_xact_age
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks
ON blocked_locks.pid = blocked.pid
JOIN pg_locks blocker_locks
ON blocker_locks.locktype = blocked_locks.locktype
AND blocker_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocker_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocker_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocker_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocker_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocker_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocker_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocker_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocker_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocker
ON blocker.pid = blocker_locks.pid
WHERE NOT blocked_locks.granted
AND blocker_locks.granted;
If this returns rows during an incident, do not start random tuning. Identify the blocker, the transaction age, and why it is open.
Look at query concentration
If pg_stat_statements is enabled, it is usually the fastest path to query pressure.
SELECT
calls,
round(total_exec_time::numeric, 1) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 1) AS pct_time,
left(query, 180) AS query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
The goal is not to shame the slowest query. The goal is to know whether one query family owns the incident.
Check vacuum and dead tuples
SELECT
relname,
n_live_tup,
n_dead_tup,
last_autovacuum,
autovacuum_count,
vacuum_count
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
High dead tuple counts do not automatically mean disaster, but they explain why reads got slower after a write-heavy period. Pair this with old transaction checks before blaming autovacuum settings.
Check replication and WAL only if writes are involved
SELECT
application_name,
state,
sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_bytes_lag
FROM pg_stat_replication;
SELECT
slot_name,
slot_type,
active,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_bytes
FROM pg_replication_slots
ORDER BY retained_bytes DESC NULLS LAST;
Replication slots are a common reason WAL keeps growing after traffic calms down. If a slot is inactive and retaining a lot of WAL, disk is the next incident.
What this check does not prove
A health check is a triage tool. It does not prove the database is healthy forever. It only tells you which layer deserves the next 15 minutes. If every query is waiting on the app, stop tuning Postgres. If one lock is blocking 80 sessions, stop looking at CPU.
The pragmatic default
Run the same short sequence every time. Sessions, old transactions, blockers, query concentration, vacuum, replication, WAL. Consistency matters because incidents punish creativity. The first pass should be muscle memory.