A Postgres instance pegged on CPU is one of the more frustrating performance issues to diagnose because the symptom is generic and the causes are specific. "The database is slow" is what users see; "CPU at 95%" is what monitoring shows. The question is which queries are doing it.
Here is the diagnostic sequence I run.
Step 1: Confirm CPU is the bottleneck
The cloud's metric or top on the host should show CPU near 100%. If CPU is normal but queries are still slow, the issue is elsewhere (locks, I/O, memory).
Within Postgres, look at pg_stat_activity:
SELECT state, count(*)
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY state;
Many active queries means the workload is genuinely heavy. Few active but high CPU means a small number of queries are very expensive.
Step 2: Find the CPU-heavy queries
The canonical query in pg_stat_statements:
SELECT
round(total_exec_time::numeric, 0) AS total_ms,
calls,
round(mean_exec_time::numeric, 2) AS mean_ms,
query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
The top queries by total_exec_time are where most CPU goes. Two patterns:
- High
total_msand very highcalls: a fast query called millions of times. The aggregate is the issue. - High
total_msand modestcalls: each execution is expensive. EXPLAIN one.
For case 1, the fix is usually to reduce calls — add caching, batch the requests, eliminate N+1.
For case 2, the fix is the query itself.
Step 3: EXPLAIN the worst offender
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
<query from step 2>;
Look for:
- Sequential scans on large tables: missing index. CPU goes to scanning rows.
- Hash join with
Batches > 1: hash table spilled to disk. CPU goes to multi-pass aggregation. - Nested loop with very high inner loop count: planning bug or missing index. CPU goes to repeated lookups.
- Sort that exceeds work_mem: external sort. CPU plus disk I/O.
Most "high CPU" queries fall into one of these patterns.
Step 4: Check planning time vs execution time
A non-obvious cause of high CPU: the planner itself.
EXPLAIN ANALYZE <query>;
-- Output ends with:
-- Planning Time: 0.500 ms
-- Execution Time: 50.000 ms
For most queries, planning is a fraction of execution. If planning time is comparable to or larger than execution time, the issue is the planner.
Causes:
- Many partitions, no partition pruning: the planner has to consider every partition. For 1000+ partitions with cheap queries, planning becomes the dominant cost. Check that the query has predicates that allow pruning.
- Many indexes on the table: the planner considers each. Drop unused indexes.
- Highly complex query with many JOINs: planning is intrinsically expensive.
For case 1 and 2, the fix is structural. For case 3, prepared statements help (skip the parse-plan step on repeated execution).
Step 5: Connection storms
A sudden spike in active connections — application restart, network blip — produces a CPU spike from connection setup overhead.
SELECT now() - backend_start AS conn_age, count(*)
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY 1
ORDER BY 1
LIMIT 20;
Many very young connections means a storm. Connection establishment is CPU-intensive in Postgres (especially with SSL).
Fix: connection pooler (PgBouncer). Reduces connection churn.
Step 6: Look at the wait events
If pg_wait_sampling extension is installed, it gives you per-query wait event breakdowns:
SELECT
query,
count(*) AS samples,
array_agg(DISTINCT wait_event) AS wait_events
FROM pg_wait_sampling_profile
JOIN pg_stat_statements ON pg_stat_statements.queryid = pg_wait_sampling_profile.queryid
GROUP BY query
ORDER BY count(*) DESC
LIMIT 10;
Queries waiting on CPU events are CPU-bound (literally executing). Queries waiting on LWLock events are stuck on internal locks. Queries waiting on IO are disk-bound.
This tells you whether the CPU usage is doing work or fighting locks.
Step 7: Common patterns
The specific patterns I see most:
1. Missing index leading to seq scan. Check pg_stat_user_tables for tables with very high seq_tup_read.
2. Same query called millions of times. Application N+1 bug. Look at the highest-calls queries and trace back to application code.
3. Heavy report running during peak. A weekly cron that does aggregation pulled the database into peak CPU. Move off-peak or to a replica.
4. Bloat causing extra rows to scan. Vacuum-starved tables read dead tuples. CPU goes to filtering them out.
5. JSONB heavy queries. Path expressions on JSONB are CPU-expensive. Worth optimizing if they are dominant.
What I check first
For any "Postgres CPU is high" investigation:
- Top queries by total_exec_time. Pick the worst.
- EXPLAIN the worst. Find the seq scan or expensive node.
- Check connection count for storm signs.
- Check vacuum status for bloat-induced extra work.
- Plan a fix (index, query rewrite, application change).
Most cases resolve at step 2 with an index addition.
A worked example
Real incident: CPU at 95% sustained for 3 hours. Application timeouts climbing.
SELECT round(total_exec_time::numeric), calls, query
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 5;
Top query: 30 million calls in 3 hours, 50ms each. Aggregate: 25 minutes of CPU time on a 4-CPU instance, hence 95% utilization.
The query was a simple lookup that should have been 2ms but was 50ms because it was doing a seq scan (the index had been dropped during a migration).
Fix: CREATE INDEX CONCURRENTLY .... Within minutes, query time dropped to 2ms, CPU dropped to 30%.
Total time from page to fix: 18 minutes.
The lesson: high CPU is rarely "the server is undersized." It is almost always a specific query doing more work than it should.