The deploy went out at 14:00. By 14:30, an alert says one query's p99 latency jumped from 50ms to 4 seconds. The query is unchanged — same SQL, same parameters, same data — but the plan is different.
Plan regressions are one of the harder Postgres incidents to diagnose because the database is doing what it always does (planning each query); it just happens to be doing it differently now.
Here is the diagnosis workflow.
Step 1: Confirm it is a plan regression, not a workload change
SELECT
queryid, calls, total_exec_time, mean_exec_time, stddev_exec_time, query
FROM pg_stat_statements
WHERE query ILIKE '%suspect_pattern%'
ORDER BY total_exec_time DESC;
If the same queryid exists with similar calls but much higher mean_exec_time than yesterday's snapshot, the plan changed. If calls also spiked, the workload changed.
For distinguishing, compare to a snapshot:
-- Today
SELECT mean_exec_time FROM pg_stat_statements WHERE queryid = ?;
-- Yesterday's snapshot (assuming you have history)
SELECT mean_exec_time FROM pg_stat_statements_history
WHERE queryid = ? AND captured_at = '2025-01-14 00:00';
Mean execution time doubled with similar call count = plan regression.
Step 2: Capture the current plan
If auto_explain is logging at a low enough threshold, the slow plan is in recent logs. Extract it.
If not, run EXPLAIN ANALYZE manually with realistic parameters:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT ... -- the regressed query, with realistic parameters
Note the plan structure. Save it somewhere.
Step 3: Capture what the plan should be
If you have a saved good plan from before the regression (e.g., from an auto_explain log entry pre-deploy), compare. Look for:
- Different join order.
- Different join type (hash vs nested loop vs merge).
- Different index choice.
- Sequential scan instead of index scan.
The difference reveals what the planner is now thinking that it was not before.
If you do not have the old plan, you can sometimes reproduce on a replica that has not been updated yet, or on a clone.
Step 4: Why did the plan change?
The usual causes:
1. Statistics changed. Auto-analyze ran on a table where the data shape shifted. The new stats produce different cost estimates, leading to a different plan.
Diagnostic:
SELECT relname, last_autoanalyze, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname = 'affected_table';
If last_autoanalyze was right before the regression, statistics are the suspect.
Fix: increase the statistics target on relevant columns or use extended statistics.
2. Index dropped or invalidated. Someone dropped an index, or a CONCURRENTLY index build failed and left an INVALID index.
SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE indrelid = 'affected_table'::regclass;
If an index is INVALID or missing, the planner cannot use it.
Fix: rebuild the index.
3. Configuration change. Someone changed random_page_cost, effective_cache_size, or another planner parameter. The change shifted plan costs.
Check for recent settings changes. Compare current values to known-good values.
4. Application started passing different parameter shapes. A prepared statement's generic plan became wrong for newly-common parameter values.
Check if the application started using a new value space. The fix is plan_cache_mode = 'force_custom_plan' for the affected query.
5. Postgres version upgrade. Major versions change planner behavior. The query that was fine on v14 picks a different plan on v16.
For this case, often the fix is to update statistics targets, add hints via pg_hint_plan, or tune cost parameters.
Step 5: Mitigate immediately
While diagnosing root cause, get production back to functional:
For statistics-driven regressions:
ANALYZE affected_table;
Might refresh stats and revert the plan. Sometimes the plan is now the new one regardless.
For index issues:
REINDEX INDEX CONCURRENTLY broken_index;
-- Or recreate if missing
For config issues:
Revert the change.
For prepared statement issues:
SET plan_cache_mode = 'force_custom_plan';
Deploy this in the application's session config for the affected query.
As a last resort:
Use pg_hint_plan to force a specific plan for the query while you fix the underlying issue. This is a workaround, not a fix.
Step 6: Validate
Run the query with the mitigation. EXPLAIN ANALYZE confirms the plan is now what you expected. The application's latency metrics return to baseline.
A real diagnostic
Real scenario: a search query was 100ms, suddenly became 8 seconds the morning after a deploy.
Diagnostic sequence:
pg_stat_statements: same query, same calls, mean_exec_time jumped 80x. Plan regression.- EXPLAIN: now using a sequential scan instead of an index scan.
- Checked
pg_index: the relevant index existed butindisvalid = false. - Last night's deploy ran
CREATE INDEX CONCURRENTLYthat timed out and left an invalid index. The application code change relied on this index. - Fix:
DROP INDEX CONCURRENTLY broken_index; CREATE INDEX CONCURRENTLY new_index;
Back to baseline within 15 minutes of detection.
The lesson: invalid indexes from failed concurrent builds are a quiet failure mode. The build returns success-ish, but the index is unusable. Always verify after CREATE INDEX CONCURRENTLY that the index is valid.
Prevention
Things I have added to deploy processes after seeing regressions:
- After every CREATE INDEX CONCURRENTLY, query
pg_indexto confirmindisvalid = true. Fail the deploy if not. - Capture EXPLAIN plans for hot queries before and after major changes. Compare.
- Snapshot pg_stat_statements before and after deploys. Catch regressions in the metrics rather than waiting for user reports.
- For prepared statements, set
plan_cache_mode = 'force_custom_plan'for queries with parameter-dependent plans.
The goal is to detect plan regressions in seconds, not when users notice. With the right monitoring, this is achievable.
What I commit to
- pg_stat_statements snapshot before deploys, comparison after.
- auto_explain at sane threshold to capture slow plans.
- Verify CONCURRENTLY index builds succeeded.
- Documented mitigations for each common regression type.
Most plan regressions are detectable and reversible if the team has the tools. Without them, the only signal is user complaint, by which time the impact has propagated.