By dbivioSlow QueriesJune 6, 20269 min read

PostgreSQL Extended Statistics: When the Planner's Estimates Are Off by 1000x

The planner assumes your columns are independent. When they're actually correlated, its row estimates collapse, it picks a nested loop over a hash join, and a fast query becomes a slow one. CREATE STATISTICS fixes the math.

The query plan looked reasonable. Nested loop, index scans, estimated 3 rows. The actual runtime was 40 seconds and the actual row count was 90,000. The planner had estimated 3 rows and gotten back tens of thousands, so it had chosen a plan that is great for 3 rows and a disaster for 90,000.

The estimate was wrong because the planner assumed two columns were independent when they were almost perfectly correlated. city and postal_code: knowing the postal code basically determines the city. The planner did not know that, so it multiplied two selectivities and got a number that was off by four orders of magnitude.

This is one of the highest-value, least-used features in PostgreSQL. When estimates are catastrophically wrong on multi-column filters, extended statistics is usually the cure.

The independence assumption

By default, PostgreSQL keeps single-column statistics and assumes columns are statistically independent. To estimate the selectivity of WHERE city = 'X' AND postal_code = 'Y', it multiplies the selectivity of each predicate. If each matches 1% of rows, it estimates 0.01 * 0.01 = 0.0001 of the table.

When the columns are independent, that math is right. When they are correlated, it is wildly wrong. A postal code essentially implies its city, so the combined predicate matches roughly the same rows as the postal code alone, not a tiny fraction of it. The estimate undershoots by the correlation factor.

How a bad estimate becomes a slow query

Row estimates drive plan choice. A tiny estimate makes the planner pick a nested loop, expecting to probe an index a handful of times. When the real count is huge, that nested loop probes the index tens of thousands of times, and what should have been one hash join becomes a storm of index lookups.

So the symptom is not just a wrong number in EXPLAIN. It is the wrong algorithm. The fix is to make the estimate accurate so the planner reaches for the join strategy that actually fits the data.

-- The tell: estimated rows vs actual rows are wildly different
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM addresses
WHERE city = 'Springfield' AND postal_code = '62704';
-- Look for: (rows=3) ... (actual rows=41000). That gap is the bug.

CREATE STATISTICS, the dependencies kind

Extended statistics let you tell PostgreSQL that a group of columns is related. The dependencies kind captures functional dependencies, exactly the "postal code implies city" relationship. After you create it and ANALYZE, the planner stops naively multiplying and uses the real combined selectivity.

You define it once on the columns that travel together in filters. ANALYZE then computes the statistics, and from that point the planner's multi-column estimates reflect reality.

CREATE STATISTICS addr_city_zip (dependencies)
  ON city, postal_code FROM addresses;

ANALYZE addresses;

-- Re-check: the estimate should now track actual rows
EXPLAIN ANALYZE
SELECT * FROM addresses
WHERE city = 'Springfield' AND postal_code = '62704';

ndistinct for GROUP BY estimates

There is a second common failure: GROUP BY over multiple columns. The planner estimates the number of groups by multiplying per-column distinct counts, which overcounts badly when columns are correlated. That bad group estimate leads to wrong memory sizing and the wrong aggregation strategy.

The ndistinct kind of extended statistics records the real number of distinct combinations, so the planner sizes hash aggregates and chooses between hashing and sorting correctly.

CREATE STATISTICS sales_region_product (ndistinct)
  ON region, product_category FROM sales;
ANALYZE sales;
-- Now GROUP BY region, product_category is sized correctly

The MCV kind for skewed combinations

There is a third kind, mcv (most common values), which stores the most frequent multi-column combinations. It helps when specific value pairs are far more common than independence would predict, for example a handful of high-volume (country, currency) pairs in a global table. With MCV stats the planner can estimate those hot combinations precisely instead of averaging them away.

You can request multiple kinds in one statistics object, and in most cases there is no reason not to, since the cost is a little extra work during ANALYZE.

-- Capture all useful kinds at once
CREATE STATISTICS orders_geo (dependencies, ndistinct, mcv)
  ON country, currency, region FROM orders;
ANALYZE orders;

-- Inspect what was gathered
SELECT * FROM pg_stats_ext WHERE statistics_name = 'orders_geo';

How to spot where this helps

You do not sprinkle extended statistics everywhere. You apply it where the planner is provably wrong. The workflow is mechanical: find queries where EXPLAIN ANALYZE shows estimated rows far from actual rows on a multi-column filter or group-by, identify the correlated columns, create the matching statistics object, ANALYZE, and confirm the estimate and the plan improved.

Find a slow query with a large estimated-vs-actual row gap.
Identify the correlated columns in its filter or GROUP BY.
CREATE STATISTICS with dependencies and/or ndistinct and/or mcv.
Run ANALYZE on the table.
Re-run EXPLAIN ANALYZE and confirm the estimate and plan are now sane.

The practical standard

The best PostgreSQL performance work is boring in the right way. Name the failure mode, capture the before plan or metric, make one change, and compare the exact same signal afterward. Anything else is just a more confident guess.

← Back to all articles