6 min read

Hash Joins in Postgres: When to Pick Them, When the Planner Picks Them For You

Hash joins are the workhorse for large equi-joins. They are also the join type that quietly spills to disk when work_mem is too small, turning a fast query into a slow one.

Hash join is the join type Postgres reaches for when both sides of an equi-join are big enough that nested loops would be wasteful. It builds a hash table from the smaller side, then probes it for each row of the larger side. With enough memory, it is fast. Without enough memory, it spills to disk and gets slow.

The planner usually picks the right join type. The cases worth understanding are when it is wrong, why, and how to fix the input rather than fighting the planner.

How hash join executes

Three phases:

  1. Build phase: read the smaller ("inner") side fully into memory. Hash each row by the join key. Insert into a hash table.
  2. Probe phase: read the larger ("outer") side row by row. Hash the join key, probe the hash table, emit any matching rows.
  3. Cleanup: free the hash table.

The cost: O(N + M) where N is the smaller side and M is the larger. Compared to nested loop's O(N × M / index_lookup_cost), hash join wins when both sides are non-trivial.

The critical detail: the hash table must fit in work_mem. If it does not, Postgres falls back to a multi-pass algorithm that spills batches to disk. Disk-spill hash join is much slower.

Recognizing hash join in EXPLAIN ANALYZE

Hash Join  (cost=15000.00..50000.00 rows=10000 width=64)
  Hash Cond: (orders.customer_id = customers.id)
  -> Seq Scan on orders
  -> Hash
        Buckets: 4096  Batches: 1  Memory Usage: 1024kB
        -> Index Scan on customers

The key fields:

  • Buckets — hash table size.
  • Batches — 1 means in-memory, 2+ means spilled to disk.
  • Memory Usage — actual memory consumed.

Batches: 1 is the goal. Batches: 16 means 16 disk-resident partitions, each processed separately.

When hash join is wrong

The planner sometimes picks hash join when it should not:

1. The smaller side is much smaller than the planner thought. With bad statistics, the planner may estimate that a side has 100,000 rows when it actually has 10. A nested loop with the small side as the outer would be much faster, but the planner picked hash because the estimate was wrong.

Fix: ANALYZE the affected table. If the estimate is structurally hard to get right, increase statistics target.

2. The join columns have very high cardinality and skewed distribution. Hash collisions become a problem when many rows hash to the same bucket. Most workloads do not hit this, but extreme cases can cause hash join to perform worse than alternatives.

Fix: usually rewrite the query so the smaller, more selective side is filtered earlier.

When hash join is right but slow

Most commonly, the slowness is Batches: > 1 — the hash table is spilling.

-> Hash
     Buckets: 65536  Batches: 32  Memory Usage: 1024kB

32 batches means the hash table needed roughly 32x more space than work_mem allowed. The fix is one of:

  • Raise work_mem for this query: SET work_mem = '256MB' for the session.
  • Filter the inner side earlier: a WHERE clause that reduces the inner side before the hash build.
  • Add an index that allows a different join type: a composite index on the inner side might enable a merge join, which is more memory-efficient than hash.

work_mem sizing

The rule of thumb: hash table size is roughly the inner side's row count × (key width + overhead). For a 1-million-row inner side with 50-byte rows, the hash table needs about 200-300MB.

Default work_mem is 4MB. That is enough for joins where the inner side is a few thousand rows. For joins on hundreds of thousands or millions of rows, you need more.

For known heavy queries, set SET LOCAL work_mem = '256MB' per session. Do not raise the global default casually — work_mem is per-operation, and a busy server with 100 concurrent queries × 256MB = 25GB+ of RAM.

The merge join alternative

For very large joins where hash spills badly, merge join is sometimes the answer:

-- Both sides have indexes on the join column
SELECT *
FROM orders o
JOIN customers c ON c.id = o.customer_id;

If both sides have an index on the join column (or the data happens to be sorted), Postgres can do a merge join: scan both in order, merge them. Memory usage is constant — a few rows at a time.

Merge join is rarely the absolute fastest, but it is the most predictable on very large data. The planner picks it when the cost analysis shows hash would spill.

A worked diagnosis

Real scenario: a query joining a 50M-row events table to a 1M-row users table. Plan was hash join with Batches: 200. Query took 45 seconds.

Diagnostic:

EXPLAIN (ANALYZE, BUFFERS) <query>;

Found: hash table needed ~600MB, work_mem was 4MB.

Fix sequence:

  1. SET LOCAL work_mem = '512MB'. Re-run. New plan: hash join with Batches: 1. Runtime: 4 seconds.
  2. Verified that the application's connection pool would not multiply this enough to cause memory pressure.
  3. Set the work_mem change in the application's session config for the affected query.

The overall cost was the same memory; we just gave it to the right query at the right time.

What I check first

When a hash join is slow:

  1. Look at Batches. If > 1, the hash table is spilling. Raise work_mem.
  2. Check the inner side row estimate vs actual. Bad estimates can pick the wrong side as inner.
  3. Verify the inner side is actually the smaller side. Sometimes the planner gets this wrong; check both sides' actual row counts.
  4. Consider whether merge join would be feasible — both sides indexed on join column.

Most slow hash joins are work_mem problems. A handful are bad estimate problems. The rest are usually queries that genuinely need to do that much work, and the answer is materialization or schema changes.