10 min read

pg_stat_statements: Turning Slow Query Noise Into a Tuning Queue

pg_stat_statements is where PostgreSQL performance work becomes concrete. The trick is ranking by impact, variance, I/O, and calls instead of staring at one dramatic slow query.

The first slow-query meeting usually starts with one scary log line. Someone found a query that took 18 seconds, pasted it into chat, and now everyone is arguing about an index. Sometimes that query matters. Often it is just the loudest thing that happened once.

I would rather start with pg_stat_statements. It turns the conversation from anecdotes into a queue. Which query shape consumes the most total time? Which one runs constantly? Which one has huge variance? Which one spills to temp files or reads too many shared blocks?

That shift matters because performance tuning is an allocation problem. You have a limited number of risky changes you can make this week. The first job is choosing the query that deserves one.

Start with cumulative pain, not the slowest single call

A query that takes 400 ms and runs 2 million times a day may hurt users more than a query that took 20 seconds once. Sort by total execution time first. It gives you the workload's tax bill.

SELECT
  queryid,
  calls,
  round(total_exec_time::numeric, 2) AS total_ms,
  round(mean_exec_time::numeric, 2) AS mean_ms,
  round(stddev_exec_time::numeric, 2) AS stddev_ms,
  rows,
  left(query, 180) AS query_sample
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Variance tells you where averages are lying

Mean latency can hide a query with two personalities. One tenant is fast, another is slow. One parameter value uses an index, another causes a broad scan. A high standard deviation is a prompt to test multiple parameter sets before changing anything.

SELECT
  queryid,
  calls,
  round(mean_exec_time::numeric, 2) AS mean_ms,
  round(stddev_exec_time::numeric, 2) AS stddev_ms,
  round((stddev_exec_time / NULLIF(mean_exec_time, 0))::numeric, 2) AS variation,
  left(query, 160) AS query_sample
FROM pg_stat_statements
WHERE calls > 100
ORDER BY variation DESC NULLS LAST
LIMIT 20;

I/O counters separate CPU work from storage work

The same slow query can be CPU-bound, cache-miss-bound, temp-file-bound, or waiting behind storage. The block counters are not a full profiler, but they quickly tell you whether the query is dragging data through shared buffers.

SELECT
  queryid,
  calls,
  shared_blks_hit,
  shared_blks_read,
  temp_blks_written,
  wal_bytes,
  left(query, 160) AS query_sample
FROM pg_stat_statements
ORDER BY shared_blks_read DESC
LIMIT 20;

Reset deliberately

I like taking snapshots before and after a change instead of resetting casually during an incident. If you reset every time someone asks a question, you lose the only baseline you had.

For a tuning window, reset once, let the workload run long enough to be representative, then compare the same queryid after the change. If the queryid changes because the SQL changed, keep the old and new query text in the ticket.

SELECT pg_stat_statements_reset();

The queue I actually use

This queue keeps the team honest. The next tuning target should have a reason beyond being annoying in the logs.

  1. Top queries by total execution time.
  2. High-call queries with non-trivial mean time.
  3. High-variance queries by tenant or parameter class.
  4. Queries with temp block writes.
  5. Queries with high shared block reads.
  6. Queries that got worse after the last deploy.

The production habit

The teams that get good at PostgreSQL performance do not chase every knob. They turn a vague complaint into a named failure mode, collect one clean measurement, make one change, and then compare the next measurement against the first. That rhythm is slower than guessing for the first hour and much faster by the end of the incident.