March 16, 20268 min read

Postgres Major Version Upgrades Without the Drama

Major version upgrades are intimidating until you have done one cleanly. Here is the playbook I use, the failure modes I have seen, and what to test before you commit.

I have done a lot of major-version upgrades. The first one took me three days, included two rollbacks, and aged me considerably. The most recent one took two hours including the test runs. The difference is mostly preparation.

Here is the workflow I use, the gotchas I have hit, and the parts that cannot be skipped.

The two upgrade paths

pg_upgrade — fastest. Runs pg_upgrade on a stopped database, swaps the binary, restarts. Downtime measured in minutes for most databases. Works in-place.

Logical replication — zero-downtime. Stand up a new cluster on the new version, replicate from the old, cut over with a brief failover window. More moving parts, more to go wrong, but minimal user impact.

For most teams I work with, pg_upgrade with --link is the right answer. It is fast, well-documented, and the downtime is acceptable. Logical replication is the answer when downtime is not acceptable, which is rarer than people initially claim.

What `pg_upgrade --link` actually does

The --link mode does not copy data. It hardlinks the data files in the old data directory to the new data directory. Postgres then reads the same on-disk pages with the new version's understanding of them.

The downside: you cannot rollback by booting the old version against the same data. The hardlinks are now controlled by the new cluster; the old cluster's catalog is stale. To rollback, you need a backup or a logical replica still on the old version.

In practice, this is fine if you have a recent backup, which you should have anyway. The speed benefit is enormous — multi-TB databases finish in minutes.

The pre-flight checklist

Before touching production, I do all of these:

1. Read the release notes. Yes, all of them. Major versions change planner behavior, default extensions, system catalog schemas. The release notes flag breaking changes that affect specific workloads.

2. Test the upgrade on a clone. Take a recent backup, restore to a test environment, run pg_upgrade against it. Time the operation. If it takes 2 hours on the clone, plan for 3 in production.

3. Run the application's full test suite against the upgraded clone. Catch syntax that no longer works (deprecated features get removed), planner regressions, extension version mismatches.

4. Identify and update extensions. Each extension has its own version compatibility. pg_stat_statements, pg_repack, pgcrypto, custom extensions — every one needs verification. The new cluster cannot start if an extension is missing.

5. Check connection strings and pooling. PgBouncer, application connection strings, ORM compatibility. Postgres major versions sometimes change wire protocol details (rare but not zero).

6. Backup. Backup. Backup. Taken before any upgrade attempt. Verified by a partial restore. Stored somewhere outside the database server.

The failure modes I have hit

Extension version mismatch. New cluster starts, application connects, queries fail because pg_stat_statements is older than the new cluster expects. Fix: ALTER EXTENSION pg_stat_statements UPDATE after the upgrade.

Planner regressions. A query that ran in 50ms on the old version takes 5 seconds on the new one. The planner made different choices because cost estimates changed. Fix: run ANALYZE on the affected tables, sometimes increase statistics target, occasionally rewrite the query. This is the most common surprise — set aside time for it.

Default-changed settings. Postgres 14 changed the default password_encryption to scram-sha-256. Older clients that only spoke md5 broke. Fix: either update the clients or temporarily set the old value.

Removed functionality. Postgres 13 removed WITH OIDS from new tables. If your app ever used OIDs, this hit at upgrade time. Fix: rewrite the application before upgrading.

Locale changes. Glibc updates between versions can change collation. Indexes built with the old collation may sort differently with the new one. The fix is to reindex indexes on TEXT columns. The Postgres docs describe this; the symptom is silent index corruption that the application notices weeks later.

The cutover ritual

# 1. Stop the application or put it in maintenance mode
# 2. Final backup
pg_basebackup -D /backup/pre-upgrade -Ft -z -P

# 3. Stop the old cluster
pg_ctl -D /var/lib/postgres/14 stop

# 4. Run pg_upgrade in check mode first
pg_upgrade \
  --old-bindir=/usr/lib/postgres/14/bin \
  --new-bindir=/usr/lib/postgres/16/bin \
  --old-datadir=/var/lib/postgres/14 \
  --new-datadir=/var/lib/postgres/16 \
  --check

# 5. If check passes, do it for real
pg_upgrade \
  --old-bindir=/usr/lib/postgres/14/bin \
  --new-bindir=/usr/lib/postgres/16/bin \
  --old-datadir=/var/lib/postgres/14 \
  --new-datadir=/var/lib/postgres/16 \
  --link

# 6. Start the new cluster
pg_ctl -D /var/lib/postgres/16 start

# 7. Run analyze on all databases
vacuumdb --all --analyze-in-stages

# 8. Run the application against the new cluster

The --check step is non-destructive and tells you if anything in the data prevents a successful upgrade. Always run it first.

On managed services

RDS, Aurora, Cloud SQL, AlloyDB — they all have managed major-version upgrade flows. The pre-flight checklist still applies. The cutover ritual is mostly automated, but the planner regressions and extension mismatches are still your problem.

For managed services I add one extra step: read the provider's upgrade docs in detail. They list provider-specific gotchas that the Postgres release notes do not cover. Aurora has rules about cluster vs instance upgrades. RDS has minor-version paths between major-version targets.

Post-upgrade verification

After the upgrade succeeds, the next 48 hours are when you find regressions. The checks I run:

pg_stat_statements snapshot before and after, comparing top queries by total time. New top queries are suspect.
pg_stat_user_tables for last_autovacuum — make sure autovacuum is doing its job.
Slow query log for the first 24 hours. New entries that did not exist before are planner regressions.
Application error rate. New patterns of errors point at compatibility issues.

Most of the time everything is fine. When it is not, the symptoms appear within the first day, and the fixes are concentrated in a handful of queries.

What I do differently now

The main thing experience changed for me: I do upgrades on a routine, not on a deadline. Skipping a major version is technical debt. Being two majors behind is operational risk. Routine cadence — upgrade within a quarter or two of the new release — keeps each upgrade a smaller delta.

The team that upgraded once every five years had the worst upgrade experience I have seen. The team that upgraded yearly had the smoothest. That difference is mostly tooling and confidence built up by repetition.

← Back to all articles