11 min read

EXPLAIN ANALYZE BUFFERS: Reading PostgreSQL Plans Like a Production Trace

EXPLAIN ANALYZE is not just a plan tree. With BUFFERS and real row counts, it becomes a production trace for estimates, I/O, loops, spills, and wasted work.

A PostgreSQL plan can look intimidating because it is written like a tree and people read it like a paragraph. That is the first mistake. A plan is not prose. It is a trace of where the executor spent work.

When a query is slow, I want EXPLAIN (ANALYZE, BUFFERS) as soon as I can run it safely. The estimated plan tells me what PostgreSQL expected. The analyzed plan tells me what actually happened. The gap between those two stories is usually where the fix lives.

Costs are useful, but they are not the first thing I trust. I trust actual rows, loops, timing, and buffers because those are the footprints the query left behind.

Always ask for real work

Plain EXPLAIN can be useful when you cannot run the query, but it is still only a prediction. For tuning, run the query in a safe environment with the real parameters and inspect the actual work.

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT o.id, o.created_at, c.email
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.tenant_id = $1
  AND o.created_at >= now() - interval '7 days'
ORDER BY o.created_at DESC
LIMIT 100;

Row estimate gaps are planner debt

If a node estimated 100 rows and returned 100,000, the rest of the plan was built on bad math. That can choose the wrong join order, the wrong index, the wrong aggregate strategy, and the wrong memory expectation.

I look for the first node where estimates and actuals diverge badly. Fix that upstream lie before blaming the downstream join.

Loops multiply pain

Nested loops are not bad. A nested loop over three rows can be perfect. A nested loop over 300,000 rows with an expensive inner lookup is a different story. The loops field tells you how many times a node really ran.

When a plan says an inner index scan took 0.05 ms, do not stop there. If it ran 600,000 times, that tiny number became the incident.

Buffers show whether the query touched the world

A query can be slow because it reads from disk. It can also be slow because it hits millions of cached pages and burns CPU walking them. The buffer counters help separate those cases.

  • shared hit: pages found in shared buffers.
  • shared read: pages read into shared buffers.
  • shared dirtied: pages changed by the query.
  • temp read/write: temporary file activity from sorts, hashes, or materialization.

Sorts and hashes tell you when memory was not enough

Look for external sort methods, large hash batches, or temp buffers. Those are signs that the executor needed more memory than the current operation had available. The fix might be a better index, a smaller input, a different join strategy, or a targeted work_mem change.

The plan review checklist

  1. Find the slowest branch by actual time and loops.
  2. Find the first major estimate error.
  3. Check whether rows removed by filter are excessive.
  4. Check shared reads versus shared hits.
  5. Check temp files, sort methods, and hash batches.
  6. Change one thing and compare the same plan again.

The production habit

The teams that get good at PostgreSQL performance do not chase every knob. They turn a vague complaint into a named failure mode, collect one clean measurement, make one change, and then compare the next measurement against the first. That rhythm is slower than guessing for the first hour and much faster by the end of the incident.