March 5, 20266 min read

Postgres Connection Storms: When the Pool Becomes the Outage

Every connection storm I have seen looks like a database problem and is actually an application or pool problem. Here is how to tell the difference and stop blaming Postgres.

A connection storm is when your application opens hundreds of new database connections in seconds. Postgres reaches max_connections, refuses new ones, and the application starts erroring. This looks like a database failure but is almost always something else upstream.

I have helped diagnose this maybe a dozen times. The pattern is consistent enough that I have a checklist.

What it looks like

During the storm, the application logs are full of connection refused or too many clients already. Postgres logs show rejected connection attempts. pg_stat_activity shows close to max_connections rows, often most of them in idle state.

The key symptom: the database itself is not slow. CPU is fine. Disk I/O is fine. Active queries (those in state = 'active') are not the problem. The problem is the count of established connections.

Why it happens

Three common causes:

1. The connection pool collapsed and is rebuilding. A network blip, a database restart, a load balancer change — anything that drops existing connections. The application sees them gone and immediately tries to recreate all of them. If 200 application instances each try to open 10 connections simultaneously, that is 2000 new connections in seconds.

2. The application has a leak. Connections opened but never returned to the pool. Over time, the pool grows because old connections are still nominally "in use" but doing nothing. Eventually the pool exhausts the database's connection limit.

3. A workload spike that the pool was not sized for. Black Friday, a viral mention, a cron job that fires more requests than expected. The application scales horizontally, each instance opens its share of the pool, the database cannot handle the multiplied total.

Diagnosis: who is connected and what are they doing

SELECT
  state,
  application_name,
  count(*) AS conns,
  count(*) FILTER (WHERE now() - state_change > interval '1 minute') AS stale
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY state, application_name
ORDER BY conns DESC;

This tells you who is connected, what state, and how many. A healthy pool shows mostly idle connections that are recently active (low stale count). An unhealthy pool shows lots of idle in transaction (a leak) or lots of stale idle connections (the pool grew larger than it needs).

For the storm specifically, check the rate at which connections are being established:

SELECT now() - backend_start AS conn_age, count(*)
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY 1
ORDER BY 1
LIMIT 20;

If many connections are very young (seconds old), you are mid-storm.

The first-line fix: a connection pooler in front

A connection pooler (PgBouncer is the standard) sits between your application and Postgres. Application opens many connections to the pooler; the pooler maintains a small number of long-lived connections to Postgres. The application's volatile connection pattern becomes invisible to the database.

# pgbouncer.ini
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 50

pool_mode = transaction is the right default for most applications: PgBouncer multiplexes connections at transaction boundaries. The application sees many available connections; Postgres sees a stable 50.

The one limitation of transaction mode: session-level state (prepared statements, advisory locks, SET outside transactions) does not persist across transactions. Most applications can adapt; some need session mode, which is less efficient.

Tune the application pool size

A common mistake: each application instance has 100 max connections in its pool. With 50 instances, that is 5000 potential connections to a Postgres with max_connections = 200. This is overprovisioned by 25x.

The right shape: per-instance pool size × instance count ≤ max_connections − 20 (reserve for admin and auto-vacuum).

For 50 application instances and max_connections = 200, each instance gets roughly 3-4 connections. That sounds tiny, but it is enough if the application is not doing many concurrent transactions per instance, which most are not.

If the math does not work, that is a sign you need PgBouncer.

Idle timeouts and statement timeouts

For connections that linger, two timeouts to set:

-- Drop idle-in-transaction connections after 5 minutes
ALTER SYSTEM SET idle_in_transaction_session_timeout = '5min';

-- Cancel statements that run longer than 1 minute (adjust per workload)
ALTER SYSTEM SET statement_timeout = '1min';

SELECT pg_reload_conf();

These protect against the slow-leak failure mode where a stuck request holds a connection open until the database runs out.

When the storm is happening right now

If you are paged during a storm, the order of operations:

Find the offending application: SELECT application_name, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;
Confirm the connections are from the storming app, not legitimate traffic.
If safe, kill the storming app's connections in batches: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name = 'storming_app' AND state = 'idle';
The application will reconnect, but with a smaller burst because some of its connections will return clean.
Once stable, investigate root cause.

Killing connections is a tactical fix during an incident. The strategic fix is whatever produced the storm — usually a pool misconfiguration or a missing PgBouncer.

What I do as a permanent prevention

PgBouncer in front of any Postgres serving more than 10 application instances.
Per-instance pool sized so all instances combined fit in max_connections − 20.
idle_in_transaction_session_timeout set to a reasonable value.
Application pool configured to detect and remove dead connections.
Monitoring on pg_stat_activity count, with alerts before the storm is fatal.

With these in place, connection storms are mostly prevented. When they happen, they self-resolve. Without these, they require manual intervention every time, and the next one is always a question of when, not if.

← Back to all articles