9 min read

PostgreSQL Extended Statistics: Fixing Planner Lies at the Source

The PostgreSQL planner is only as good as its estimates. Extended statistics help when correlated columns, multi-column filters, and skewed values make ordinary statistics lie.

Most bad PostgreSQL plans start as bad estimates. The planner thinks a predicate will return 200 rows and it returns 2 million. It picks a nested loop because the inner side looks tiny. It skips an index because selectivity looks weak. The execution plan is wrong because the model of the data is wrong.

Regular column statistics are powerful, but they are mostly column-local. Production schemas are not column-local. Country and currency are correlated. Tenant and status are correlated. Deleted rows cluster by lifecycle. A customer with one enterprise account does not behave like a customer with one trial account.

Extended statistics give PostgreSQL a way to understand relationships across columns. They do not replace indexing, and they do not force a plan. They fix the planner's inputs so normal planning has a better chance.

The framework: find estimate errors before adding indexes

When a query is slow, the tempting move is to add another index. Sometimes that is right. But if the plan is bad because row estimates are wrong, another index may only give the planner another bad choice.

I first look for estimate gaps in EXPLAIN (ANALYZE, BUFFERS):

EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM invoices
WHERE tenant_id = 42
  AND status = 'overdue'
  AND deleted_at IS NULL;

If the estimate says hundreds of rows and the actual result is hundreds of thousands, the planner is making decisions from a distorted picture. That is where extended statistics are worth testing.

Dependencies: when columns imply each other

Dependency statistics help when one column strongly predicts another. A simple example is a SaaS table where each tenant only uses a few statuses or regions.

CREATE STATISTICS invoices_tenant_status_deps
  (dependencies)
ON tenant_id, status
FROM invoices;

ANALYZE invoices;

Without this, PostgreSQL may multiply independent selectivities and assume tenant_id = 42 and status = 'overdue' are less related than they really are. With dependency stats, it can model that relationship more honestly.

This matters most when correlated predicates decide join order, index choice, and whether a LIMIT query can stop early.

MCV lists: when combinations are skewed

Most-common-values statistics track common value combinations across columns. That is useful when a few combinations dominate the table.

CREATE STATISTICS orders_region_channel_mcv
  (mcv)
ON region, sales_channel, status
FROM orders;

ANALYZE orders;

Imagine region = 'us', sales_channel = 'marketplace', and status = 'pending' is massively more common than other combinations. Single-column stats can see each column's distribution, but not the important combination. MCV stats can.

The practical result is better row estimates for exactly the query shapes that dashboards and operational queues tend to run all day.

ndistinct: when group counts drive the plan

Distinct-count statistics help when the planner needs to estimate the number of distinct combinations across columns. This shows up in GROUP BY, joins, and deduplication-heavy queries.

CREATE STATISTICS events_tenant_type_ndistinct
  (ndistinct)
ON tenant_id, event_type
FROM usage_events;

ANALYZE usage_events;

If each tenant emits only a small subset of event types, assuming independence can inflate group estimates. That can push the planner toward expensive aggregate strategies or memory assumptions that do not match reality.

Extended statistics do not help every predicate

Extended statistics are not magic metadata. They help when the planner can use them for compatible clauses. They do not fix every expression, every function-wrapped predicate, or every join condition.

If the query hides the column inside a function, start by making the predicate indexable and understandable:

-- Harder for the planner and indexes.
WHERE date_trunc('day', created_at) = date '2026-05-08'

-- Easier to estimate and index.
WHERE created_at >= timestamp '2026-05-08 00:00:00'
  AND created_at <  timestamp '2026-05-09 00:00:00'

Good statistics cannot compensate for a query shape that hides the useful data relationship.

How I roll it out

  1. Capture the bad plan with EXPLAIN (ANALYZE, BUFFERS).
  2. Name the estimate error: which node expected the wrong number of rows?
  3. Identify the correlated columns that drive the error.
  4. Create the narrowest useful statistics object.
  5. Run ANALYZE and compare plans on representative parameters.
  6. Keep the statistics object only if the estimate and plan improve.

The rollback path is simple: drop the statistics object and analyze again. That makes extended statistics a low-risk experiment compared with many index changes.

What I monitor

  • Estimate ratio: actual rows divided by estimated rows on critical plans.
  • Plan stability: whether join order changes after ANALYZE or data imports.
  • Autovacuum analyze cadence: stats that are correct once can drift after churn.
  • Stats target pressure: whether key columns need a higher statistics target.
  • Dashboard parameter classes: plans should be checked across common and rare values.

The pragmatic default

Use extended statistics when bad estimates come from correlated columns, not when the table simply lacks an index. They are especially useful on multi-tenant tables, operational queues, analytics filters, and dashboards with repeated multi-column predicates.

The mistake is treating the planner as mysterious. It is usually doing the best it can with the facts you gave it. Extended statistics are how you give it better facts.