April 4, 202611 min read

How I Actually Find and Fix Slow PostgreSQL Queries

A practical workflow for slow-query incidents: find the fingerprint, read the plan, prove the bottleneck, and ship the smallest safe fix.

I got paged at 2 AM once because a reporting query that normally took 800ms was suddenly taking 4 minutes. The on-call engineer had already restarted the app server (classic), which helped exactly nothing. By the time I SSH'd in, we had a queue of 300 pending requests and an angry VP sending Slack messages.

The fix took 11 minutes once I knew where to look. Getting to "where to look" took 9 of those minutes.

This is the workflow I use now. It's not glamorous. It's just what works.

Start with pg_stat_statements

If you don't have pg_stat_statements enabled, stop reading and go enable it. Everything else is guesswork without it.

# postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all

Restart PostgreSQL, then:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Now you have a view that accumulates query statistics over time. The key is knowing which columns to look at.

What to actually query

Most people sort by total_exec_time descending and call it a day. That's fine for a quick look, but you'll miss things. Here's the query I actually use:

SELECT
  round(total_exec_time::numeric, 2) AS total_ms,
  calls,
  round(mean_exec_time::numeric, 2) AS mean_ms,
  round(stddev_exec_time::numeric, 2) AS stddev_ms,
  round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS pct_total,
  left(query, 120) AS query_snippet
FROM pg_stat_statements
WHERE calls > 50
ORDER BY total_exec_time DESC
LIMIT 25;

The stddev_exec_time column is underrated. High standard deviation means the query is sometimes fast and sometimes slow — which usually points to lock contention or missing cache hits rather than a fundamentally broken query plan.

A query with mean_ms = 200 and stddev_ms = 800 is a different problem than one with mean_ms = 200 and stddev_ms = 5.

Also look at this:

SELECT
  calls,
  round(mean_exec_time::numeric, 2) AS mean_ms,
  round(total_exec_time::numeric / 1000, 2) AS total_sec,
  left(query, 120) AS query
FROM pg_stat_statements
WHERE mean_exec_time > 1000  -- slower than 1 second on average
ORDER BY mean_exec_time DESC;

Anything averaging over a second is almost always worth looking at, regardless of call count.

EXPLAIN ANALYZE: Reading it Without Losing Your Mind

Once you have a query to investigate, run EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT). Not just EXPLAIN. Not EXPLAIN ANALYZE without BUFFERS. All three together.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.name, COUNT(o.id) as order_count
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2025-01-01'
GROUP BY u.id, u.name
ORDER BY order_count DESC
LIMIT 10;

What the output is actually telling you

The output is a tree. Execution happens bottom-up. The "cost" numbers are the planner's estimate — ignore them at first and focus on the actual times.

Limit (cost=12483.21..12483.24 rows=10 width=40) (actual time=2341.5..2341.5 rows=10 loops=1)
  -> Sort (cost=12483.21..12508.21 rows=10000 width=40) (actual time=2341.4..2341.4 rows=10 loops=1)
    -> HashAggregate (actual time=2280.1..2320.8 rows=10000 loops=1)
      -> Hash Join (actual time=451.2..1890.3 rows=2400000 loops=1)
           Buffers: shared hit=1240 read=28400
        -> Seq Scan on orders (actual time=0.1..890.4 rows=5000000 loops=1)
             Buffers: shared hit=420 read=27800

See that Seq Scan on orders with shared read=27800? That's 27,800 8KB pages read from disk — about 217MB. For a table with 5 million rows. That's your problem right there.

The ratio between shared hit and shared read tells you whether data is in shared_buffers (hot) or on disk (cold). A query that's fast in dev but slow in prod often has this ratio inverted because prod has more data than fits in cache.

The estimate vs actual mismatch

This one burns people constantly. If the planner estimated 100 rows but got 50,000, the rest of the plan was built on a lie. Look for nodes where rows=X (estimated) is wildly different from actual rows=Y.

Hash Join (cost=234.1..890.2 rows=150 width=32) (actual time=340.2..4521.8 rows=87432 loops=1)

Estimated 150 rows, got 87,432. That hash join is probably the wrong strategy here — the planner would have chosen a different join type if it knew the real cardinality. This is a stats problem, fixed with ANALYZE.

Common Problems and Actual Fixes

Missing indexes (the obvious one)

-- Find tables with sequential scans that have lots of rows
SELECT
  schemaname,
  tablename,
  seq_scan,
  seq_tup_read,
  idx_scan,
  n_live_tup
FROM pg_stat_user_tables
WHERE seq_scan > 0
  AND n_live_tup > 10000
ORDER BY seq_tup_read DESC
LIMIT 20;

One team I worked with had a notifications table with 12 million rows and zero indexes except the primary key. Every status update check was a full table scan. Adding a partial index on (user_id, status) WHERE status = 'unread' dropped query time from 8 seconds to 2ms. That's not a typo.

CREATE INDEX CONCURRENTLY idx_notifications_user_unread
ON notifications (user_id)
WHERE status = 'unread';

Always use CONCURRENTLY on production tables. Without it, you lock the table for the entire index build.

Table bloat killing your queries

MVCC means dead tuples stick around until VACUUM cleans them. On a high-churn table, you can end up with 90% dead tuples — meaning your "sequential scan" is actually reading mostly garbage.

SELECT
  schemaname,
  tablename,
  n_dead_tup,
  n_live_tup,
  round(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
  last_vacuum,
  last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY dead_pct DESC;

If you're seeing dead_pct above 20% on frequently-queried tables, your autovacuum config needs tuning. I cover that in detail in the VACUUM guide.

Bad joins — the N+1 in disguise

Sometimes the slow query isn't slow in isolation. It's an ORM generating 5,000 individual queries instead of one query with a JOIN. You can spot this in pg_stat_statements by looking for queries with extremely high calls counts and low mean_exec_time.

SELECT calls, mean_exec_time, query
FROM pg_stat_statements
WHERE query LIKE '%WHERE id = %'
  AND calls > 10000
ORDER BY calls DESC;

If you see SELECT * FROM users WHERE id = $1 called 50,000 times in an hour, you have an N+1 problem in your application code.

Lock contention — the intermittent killer

This is why stddev_exec_time matters. A query that sometimes takes 2ms and sometimes takes 8 seconds is probably waiting on a lock. Check:

SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state,
  wait_event_type,
  wait_event
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
  AND state != 'idle'
ORDER BY duration DESC;

If you see wait_event_type = 'Lock', something is blocking your query. Check pg_locks to find the holder.

Monitoring Ongoing Query Performance

One-time investigation is fine. But you want a continuous picture. MonPG tracks query performance over time so you can see when a query degraded — not just that it's slow now.

The manual approach is to periodically snapshot pg_stat_statements and compare. Here's a simple version:

-- Reset stats (do this rarely, maybe weekly)
SELECT pg_stat_statements_reset();

-- Or just track the delta manually:
-- Save pg_stat_statements to a table every hour, diff against previous snapshot

The 2 AM Fix Revisited

That 2 AM incident? The reporting query was doing a sequential scan on a 400GB table because someone had added a new filter WHERE report_type = 'weekly' — but report_type wasn't in the index.

The query had been fast before because report_type was so selective (only 0.01% of rows matched) that with an index it was nearly instant. Without the index on that column, it read the entire 400GB table.

EXPLAIN ANALYZE showed a sequential scan. pg_stat_statements showed the query had degraded exactly 3 days earlier — when the new filter was deployed.

CREATE INDEX CONCURRENTLY idx_reports_type_created
ON reports (report_type, created_at DESC);

Query went from 4 minutes to 12ms. We went back to sleep.

The lesson isn't "add indexes." It's "have the data to know what changed and when." That's what separates a 9-minute diagnosis from a 9-hour one.

← Back to all articles