A table reported one million live rows but occupied the space of six million. Five million dead tuples were sitting in it, and a sequential scan had to walk past all of them. The developer's question was fair: 'I deleted those rows. Why are they still on disk?'
The answer is PostgreSQL's concurrency model, MVCC — Multi-Version Concurrency Control. Understanding it is the difference between fighting bloat blindly and knowing exactly where it comes from. PostgreSQL does not overwrite or erase rows when you UPDATE or DELETE; it creates new versions and marks old ones, and a separate process cleans up later.
Every row carries two transaction stamps
Each row version (a tuple) has two hidden system columns: xmin, the ID of the transaction that created it, and xmax, the ID of the transaction that deleted or superseded it. A live row has an xmin and no effective xmax. A deleted or updated-away row has its xmax set to the transaction that ended its life.
You can actually look at them, which makes the abstract concrete.
SELECT xmin, xmax, * FROM accounts WHERE id = 42;
-- xmin = transaction that inserted this version
-- xmax = transaction that deleted/updated it away (0 if still live)
How a query decides what it sees
When a query starts, it takes a snapshot: the set of transactions whose effects it is allowed to see. For each tuple, PostgreSQL compares the tuple's xmin and xmax against that snapshot. A tuple is visible if its creating transaction is visible to the snapshot and its deleting transaction is not.
This is what lets readers and writers not block each other: a writer creates a new version while readers keep seeing the old one, each according to its own snapshot. Two transactions can legitimately see different versions of the same row at the same time.
Where dead tuples come from
An UPDATE does not modify a row in place. It writes a new version (new xmin) and stamps the old version's xmax. A DELETE just stamps xmax. In both cases the old version stays physically in the table, now dead — invisible to new snapshots but still occupying its page.
So a table with heavy update/delete churn accumulates dead tuples fast. The five million dead rows in that table were the corpses of updates and deletes that nothing had cleaned up yet. They count toward table size and toward the work of every sequential scan.
VACUUM is the cleanup crew
Dead tuples are removed by VACUUM, which can reclaim a tuple only once no snapshot could still need it — once it is older than the xmin horizon. Autovacuum normally does this continuously. When it falls behind, or when a long-running or idle-in-transaction session pins the horizon, dead tuples pile up and the table bloats.
That connects MVCC to almost every operational concern in PostgreSQL: bloat, the importance of healthy autovacuum, why long transactions are dangerous, and why index-only scans depend on the visibility map. They are all downstream of how visibility works.
- Tuples carry xmin (creator) and xmax (deleter) transaction IDs.
- A snapshot decides which tuple versions a query may see.
- UPDATE and DELETE leave the old version behind as a dead tuple.
- VACUUM removes dead tuples once they are past the xmin horizon.
- Stalled vacuum or pinned horizons turn dead tuples into bloat.
The practical standard
The best PostgreSQL performance work is boring in the right way. Name the failure mode, capture the before plan or metric, make one change, and compare the exact same signal afterward. Anything else is just a more confident guess.