Indexes10 min read

PostgreSQL Multi-Tenant Performance: Tenant Skew, Partial Indexes, and Noisy Neighbors

Average latency looked great while one tenant's pages timed out. Tenant skew, plan caching against the wrong tenant, and noisy neighbors are the things that actually break multi-tenant Postgres.

Our dashboards said the multi-tenant app was healthy. Average query time was fine, p95 was fine, and yet one customer kept opening tickets that their pages timed out. Averages were lying to me because almost all of our tenants were small and one was enormous.

Multi-tenant PostgreSQL fails in ways a single-tenant system never shows you, because the data is wildly uneven and the planner only gets to pick one strategy.

Tenant skew breaks the planner's assumptions

Most of my tenants had a few thousand rows; one had tens of millions. The planner builds statistics for the table as a whole, so a plan that's perfect for the typical tenant is wrong for the giant one, and vice versa. The fast nested loop for a small tenant becomes a catastrophe at scale.

The first thing I started doing was testing every important query with both a tiny tenant id and the largest one. If the plan flips badly between them, that query is a future incident.

-- Test the same query against a small and a huge tenant
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM events WHERE tenant_id = 7   ORDER BY created_at DESC LIMIT 50;

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM events WHERE tenant_id = 991 ORDER BY created_at DESC LIMIT 50;

Keep tenant_id in front of every index

The reliable win was making sure tenant_id led the composite indexes that matched our access patterns, so the planner could prune to one tenant's slice immediately instead of scanning across all of them. A trailing tenant_id is almost useless here; a leading one is everything.

For a handful of extreme tenants I added partial indexes scoped to them, which kept those indexes small and the giant tenant's queries fast without bloating the shared indexes everyone else uses.

CREATE INDEX events_tenant_created_idx
ON events (tenant_id, created_at DESC);

-- Optional: a partial index for one whale tenant
CREATE INDEX events_whale_idx ON events (created_at DESC)
WHERE tenant_id = 991;

Watch the noisy neighbor separately

Aggregate metrics hide the tenant that's hurting everyone. I started tracking p95 and p99 by tenant class, so a single heavy customer eating connections and cache couldn't disappear into a healthy-looking average. Once you can see the noisy neighbor, you can rate-limit it, move it to its own path, or schedule its heavy jobs off-peak.

  • Track p95/p99 by tenant class, not just overall.
  • Lead composite indexes with tenant_id so the planner can prune.
  • Use partial indexes for a few extreme tenants instead of one-off indexes everywhere.
  • Isolate the giants only when measurements justify it.

What I keep coming back to

None of this is clever. Name the failure mode, capture the before number, change one thing, and compare the same signal afterward. The boring version of database work is the version that actually holds up at 2 AM.