The most common scenario for auto_explain: a query was slow yesterday, the application is fine today, and the team wants to know what happened. Without auto_explain, you have aggregate stats from pg_stat_statements but no per-execution plan. With auto_explain, the plan from the slow execution is in the log.
For any production database, auto_explain is one of those settings that pays for itself the first time it catches a regression.
The basics
auto_explain is bundled with Postgres but disabled by default. Enable it via shared_preload_libraries:
shared_preload_libraries = 'pg_stat_statements,auto_explain'
This requires a restart. Once enabled, configure thresholds:
auto_explain.log_min_duration = 5000 -- log queries over 5 seconds
auto_explain.log_analyze = on -- include actual row counts
auto_explain.log_buffers = on -- include buffer hit/read stats
auto_explain.log_format = json -- parsable
auto_explain.log_nested_statements = on -- include functions and triggers
Now every query that runs longer than 5 seconds has its EXPLAIN ANALYZE output dumped to the Postgres log.
Picking the threshold
The threshold is workload-dependent.
- 5000ms (5 seconds): starting point. Catches very slow queries without flooding logs.
- 1000ms: more aggressive. Useful for OLTP where 1-second queries are notable.
- 500ms: very aggressive. Generates a lot of log volume; only do this for short investigation periods.
- 100ms: only during specific diagnosis. Will flood the log.
For most production workloads, 5000ms is the right starting point. Lower it during specific investigations, raise it back when done.
What the output looks like
LOG: duration: 8542.234 ms plan:
{
"Query Text": "SELECT * FROM orders o JOIN customers c ...",
"Plan": {
"Node Type": "Hash Join",
"Actual Total Time": 8542.123,
"Plans": [
{
"Node Type": "Seq Scan",
"Relation Name": "orders",
"Actual Rows": 50000000,
...
}
]
}
}
This is JSON-formatted EXPLAIN ANALYZE. Tools can parse it; humans can read it. Better than fishing for individual plan elements in unstructured text.
What auto_explain does NOT do
- It does not log queries below the duration threshold.
- It does not log queries that error out before the duration is measured.
- It does not affect query performance much (the cost is in the planning info collection, which is cheap).
- It does not retroactively log past slow queries — it only works for queries running while it is enabled.
For the historical view, it is pg_stat_statements plus auto_explain together.
On managed services
- RDS: enable in parameter group. Logs go to CloudWatch.
- Cloud SQL: enable via flag
cloudsql.enable_auto_explain=on. Logs go to Cloud Logging. - Azure Flex: similar via Server parameters. Logs go to Log Analytics.
The exact flag names vary; the concept is the same.
Querying the captured plans
If you forward Postgres logs to a structured logging service (CloudWatch Insights, Splunk, etc.), you can query them:
fields @timestamp, @message
| filter @message like /auto_explain/
| filter @message like /Seq Scan/
| sort @timestamp desc
This finds slow queries that did sequential scans — the most common bad plan.
For aggregating, parse the JSON server-side and treat each plan as a record. Tools like pgsplain or pgsplain.dev can format and analyze them.
The log volume question
auto_explain at 5000ms threshold generates moderate log volume. At 100ms, it can be enormous on a busy database — every "slightly slow" query produces a multi-line plan.
Budget log storage accordingly. For temporary deep diagnosis (lower threshold for an hour to capture a specific issue), it is fine. For sustained operation, keep the threshold high.
What to do with captured plans
The usual interpretations:
1. Sequential scan on a large table: missing index. The plan tells you exactly which table.
2. Hash join with high Batches: hash table spilled. Either work_mem too low or the inner side larger than expected.
3. Nested loop with very high inner loops: bad statistics or missing index. The plan shows the actual loop count.
4. Sort that spilled to disk: same as above but for sort operations.
5. Unexpected plan structure: planner picked something surprising. Compare to what you would have expected; investigate why the planner thought otherwise.
Most slow query investigations now start with "check auto_explain" and end with a specific plan node to fix.
A worked example
Real scenario: a customer-facing query was sometimes 100ms, sometimes 30 seconds. pg_stat_statements showed high stddev but no single bad execution.
With auto_explain enabled at 5000ms:
LOG: duration: 28341.123 ms plan:
{
"Plans": [{
"Node Type": "Seq Scan",
"Relation Name": "orders",
"Filter": "((status)::text = 'pending'::text)",
"Actual Rows": 12,
"Actual Total Time": 28339.000
}]
}
The plan was seq-scanning 50M orders to find 12 pending ones. The application was sometimes querying for status = 'pending' — when there were many pending rows, the index made sense; when there were few, the seq scan was much slower than it should have been.
The fix was a partial index:
CREATE INDEX CONCURRENTLY orders_pending_idx
ON orders (created_at DESC)
WHERE status = 'pending';
Latency stabilized at 5-15ms regardless of pending count.
The fix was straightforward once we had the plan. Without auto_explain, finding which queries were slow was easy; finding why was the hard part.
What I commit to for production
- auto_explain in
shared_preload_librariesfrom day 1. log_min_duration = 5000msbaseline.log_analyze = on,log_buffers = on,log_format = json,log_nested_statements = on.- Logs forwarded to a queryable system.
- Lowered threshold during active investigations, raised back afterward.
This adds moderate log volume in exchange for the ability to retroactively diagnose any slow query. The tradeoff is heavily worth it.