A long-running transaction is one of those problems that produces no errors and no slow queries — at first. Vacuum starts to fall behind. Bloat grows. The autovacuum log fills with messages about "oldest xmin held back." The application is fine. Until, three weeks later, an index is twice the size it should be and a VACUUM cannot complete because something is still holding an old snapshot.
I have inherited several Postgres clusters with this exact disease. The fix is always the same: find the long-running transaction, kill it, fix the application code that produced it.
Here is how I do that.
Why long transactions hurt
Postgres uses MVCC: every transaction sees a snapshot of the database as it existed when the transaction began. To support this, dead row versions cannot be cleaned up while any transaction might still be looking at them.
VACUUM calculates the oldest active transaction's xmin and refuses to remove dead rows newer than that. If you have a transaction that started an hour ago and is still open, every dead row created in the last hour is locked in place, even if every other transaction has long since moved on.
This is not theoretical. A single forgotten transaction holding open for a day on a high-write table can produce gigabytes of bloat that vacuum cannot reclaim.
It also retains locks. A SELECT FOR UPDATE in a long-open transaction blocks every other writer for the same row. A DDL waiting for that lock blocks every reader of the affected table.
The query that finds them
SELECT
pid,
usename,
application_name,
client_addr,
state,
now() - xact_start AS xact_duration,
now() - state_change AS time_in_state,
query
FROM pg_stat_activity
WHERE backend_type = 'client backend'
AND state != 'idle'
AND xact_start < now() - interval '1 minute'
ORDER BY xact_start;
This lists every transaction older than a minute. Adjust the threshold for your workload — for OLTP, a minute is already a problem; for batch jobs, you might tolerate longer.
The state column is the diagnostic:
active— the transaction is running a query right now. The query is the cause.idle in transaction— the connection has an open transaction but is not running anything. The application started a transaction and forgot to commit/rollback. This is almost always a bug.idle in transaction (aborted)— same as above, but a previous statement errored. The application is waiting to ROLLBACK or recover. Almost always a bug.
idle in transaction is the worst category. The connection is doing literally nothing while pinning the snapshot. I have seen these last for hours because the application's connection pool was holding the connection but the application code that produced the transaction had moved on.
How to kill them
If you have identified the offending pid:
-- Cancel the current statement, leave transaction open
SELECT pg_cancel_backend(pid_here);
-- Terminate the entire connection (rolls back the transaction)
SELECT pg_terminate_backend(pid_here);
For stuck idle-in-transaction connections, use pg_terminate_backend. Cancel does not help — there is no statement to cancel.
The application will receive a connection error. If the application is well-built, it retries on a fresh connection. If not, the user sees an error. Either way, this is preferable to letting the transaction continue to damage the database.
Preventing them
For idle in transaction specifically, set a server-side timeout:
ALTER SYSTEM SET idle_in_transaction_session_timeout = '5min';
SELECT pg_reload_conf();
Now Postgres terminates any connection that holds an open transaction for more than 5 minutes without doing anything. The application gets the same connection error it would get from a manual termination, but automatically.
For active long-running queries:
ALTER SYSTEM SET statement_timeout = '5min';
Any single statement that runs longer than 5 minutes is canceled. Use carefully — analytical queries can legitimately run longer than this, so set per-role or per-session if needed.
For truly long-running operations (bulk imports, migrations), run them with explicit timeouts that match their expected duration.
Where the bugs come from
The most common application-level patterns I have seen:
Forgetting to commit after a SELECT. ORM autocommits on read by default, but explicit BEGIN blocks need explicit COMMIT. The code path BEGIN; SELECT ...; (return value) leaks transactions.
Long inter-query work. BEGIN; SELECT * FROM users; (call third-party API for 30 seconds); UPDATE users; COMMIT. Even if everything works, the transaction is open for the API call's duration.
Connection pool checkout for the duration of a request. Some pools are configured to hold a connection across multiple distinct database operations within a request. If any of those operations open a transaction and the request stalls (waiting on something else), the transaction stays open.
The fix is application-level discipline: open a transaction only when you are about to do work, do that work, commit immediately. Do not call external services, hold user-input forms, or wait for anything between BEGIN and COMMIT.
What I do when I take over a database
First day, two queries:
- The long-transaction query above. Anything longer than a minute, I investigate.
pg_stat_databaseto checktup_deadrates. If dead tuples are accumulating faster than vacuum can clear them, long transactions are usually involved.
Fix what is currently long. Set the timeouts. Train the team to look for the pattern in their application code. The improvement in vacuum effectiveness, lock contention, and pool stability is usually visible within a week.