Hash join is the join type Postgres reaches for when both sides of an equi-join are big enough that nested loops would be wasteful. It builds a hash table from the smaller side, then probes it for each row of the larger side. With enough memory, it is fast. Without enough memory, it spills to disk and gets slow.
The planner usually picks the right join type. The cases worth understanding are when it is wrong, why, and how to fix the input rather than fighting the planner.
How hash join executes
Three phases:
- Build phase: read the smaller ("inner") side fully into memory. Hash each row by the join key. Insert into a hash table.
- Probe phase: read the larger ("outer") side row by row. Hash the join key, probe the hash table, emit any matching rows.
- Cleanup: free the hash table.
The cost: O(N + M) where N is the smaller side and M is the larger. Compared to nested loop's O(N × M / index_lookup_cost), hash join wins when both sides are non-trivial.
The critical detail: the hash table must fit in work_mem. If it does not, Postgres falls back to a multi-pass algorithm that spills batches to disk. Disk-spill hash join is much slower.
Recognizing hash join in EXPLAIN ANALYZE
Hash Join (cost=15000.00..50000.00 rows=10000 width=64)
Hash Cond: (orders.customer_id = customers.id)
-> Seq Scan on orders
-> Hash
Buckets: 4096 Batches: 1 Memory Usage: 1024kB
-> Index Scan on customers
The key fields:
Buckets— hash table size.Batches— 1 means in-memory, 2+ means spilled to disk.Memory Usage— actual memory consumed.
Batches: 1 is the goal. Batches: 16 means 16 disk-resident partitions, each processed separately.
When hash join is wrong
The planner sometimes picks hash join when it should not:
1. The smaller side is much smaller than the planner thought. With bad statistics, the planner may estimate that a side has 100,000 rows when it actually has 10. A nested loop with the small side as the outer would be much faster, but the planner picked hash because the estimate was wrong.
Fix: ANALYZE the affected table. If the estimate is structurally hard to get right, increase statistics target.
2. The join columns have very high cardinality and skewed distribution. Hash collisions become a problem when many rows hash to the same bucket. Most workloads do not hit this, but extreme cases can cause hash join to perform worse than alternatives.
Fix: usually rewrite the query so the smaller, more selective side is filtered earlier.
When hash join is right but slow
Most commonly, the slowness is Batches: > 1 — the hash table is spilling.
-> Hash
Buckets: 65536 Batches: 32 Memory Usage: 1024kB
32 batches means the hash table needed roughly 32x more space than work_mem allowed. The fix is one of:
- Raise
work_memfor this query:SET work_mem = '256MB'for the session. - Filter the inner side earlier: a
WHEREclause that reduces the inner side before the hash build. - Add an index that allows a different join type: a composite index on the inner side might enable a merge join, which is more memory-efficient than hash.
work_mem sizing
The rule of thumb: hash table size is roughly the inner side's row count × (key width + overhead). For a 1-million-row inner side with 50-byte rows, the hash table needs about 200-300MB.
Default work_mem is 4MB. That is enough for joins where the inner side is a few thousand rows. For joins on hundreds of thousands or millions of rows, you need more.
For known heavy queries, set SET LOCAL work_mem = '256MB' per session. Do not raise the global default casually — work_mem is per-operation, and a busy server with 100 concurrent queries × 256MB = 25GB+ of RAM.
The merge join alternative
For very large joins where hash spills badly, merge join is sometimes the answer:
-- Both sides have indexes on the join column
SELECT *
FROM orders o
JOIN customers c ON c.id = o.customer_id;
If both sides have an index on the join column (or the data happens to be sorted), Postgres can do a merge join: scan both in order, merge them. Memory usage is constant — a few rows at a time.
Merge join is rarely the absolute fastest, but it is the most predictable on very large data. The planner picks it when the cost analysis shows hash would spill.
A worked diagnosis
Real scenario: a query joining a 50M-row events table to a 1M-row users table. Plan was hash join with Batches: 200. Query took 45 seconds.
Diagnostic:
EXPLAIN (ANALYZE, BUFFERS) <query>;
Found: hash table needed ~600MB, work_mem was 4MB.
Fix sequence:
SET LOCAL work_mem = '512MB'. Re-run. New plan: hash join withBatches: 1. Runtime: 4 seconds.- Verified that the application's connection pool would not multiply this enough to cause memory pressure.
- Set the
work_memchange in the application's session config for the affected query.
The overall cost was the same memory; we just gave it to the right query at the right time.
What I check first
When a hash join is slow:
- Look at
Batches. If > 1, the hash table is spilling. Raisework_mem. - Check the inner side row estimate vs actual. Bad estimates can pick the wrong side as inner.
- Verify the inner side is actually the smaller side. Sometimes the planner gets this wrong; check both sides' actual row counts.
- Consider whether merge join would be feasible — both sides indexed on join column.
Most slow hash joins are work_mem problems. A handful are bad estimate problems. The rest are usually queries that genuinely need to do that much work, and the answer is materialization or schema changes.