I have helped two teams recover from incidents where the backup turned out to be unusable. Both teams had been backing up for years. Both teams had never restored. Both found out, in the middle of the incident, that something had gone wrong months earlier and nobody noticed.
The first principle of backups: backups are not real until you have restored from them.
Here is the drill cadence I push for and the tests that prove the backup works.
What "the backup works" actually means
A functional backup means you can answer "yes" to all of:
- The backup files exist where you think they do.
- They are restorable to a working Postgres instance.
- The restored instance contains the data you expected.
- Point-in-time recovery (if used) restores to the right moment.
- The application can connect and run real queries against the restored data.
Most teams test (1). Some test (2). Almost nobody tests (3) through (5) regularly.
The basic drill
Quarterly:
- Provision a fresh Postgres instance.
- Restore the most recent full backup to it.
- Run a sample of representative queries against the restored data.
- Compare row counts of key tables to production.
- Document what you learned, then tear down the test instance.
This takes about an hour for most setups and catches 90% of "the backup is broken" failure modes.
Restoring pg_basebackup output
For pg_basebackup-style backups, the restore is well-documented:
# Stop the new Postgres if running
pg_ctl -D /var/lib/postgresql/16/main stop
# Wipe and restore
rm -rf /var/lib/postgresql/16/main/*
tar -xf /backup/base.tar.gz -C /var/lib/postgresql/16/main
# If using continuous archiving for PITR, set restore_command
cat > /var/lib/postgresql/16/main/postgresql.auto.conf <<EOF
restore_command = 'aws s3 cp s3://your-bucket/wal/%f %p'
recovery_target_time = '2025-01-15 14:30:00 UTC' # for PITR
EOF
touch /var/lib/postgresql/16/main/recovery.signal
pg_ctl -D /var/lib/postgresql/16/main start
Watch the log to see Postgres apply WAL up to the recovery target. The drill is not done when Postgres starts — it's done when you have queried the database and confirmed the data is what you expect.
Restoring pg_dump output
# Create a fresh database
createdb monpg_restore_test
# Restore the dump
pg_restore -d monpg_restore_test --no-owner --no-acl /backup/monpg_2025-01-15.dump
# Or for plain SQL
psql monpg_restore_test < /backup/monpg_2025-01-15.sql
pg_dump is logical (statement-based). It is portable across Postgres versions but slower to restore than pg_basebackup. For databases over a few hundred GB, the restore can take many hours.
What to verify after restore
A short list of queries to run against the restored database:
-- Row counts of key tables
SELECT count(*) FROM users;
SELECT count(*) FROM orders;
SELECT count(*) FROM events WHERE created_at > now() - interval '1 day';
-- Data shape
SELECT max(created_at), max(id) FROM orders;
-- Any obvious corruption
SELECT count(*) FROM pg_stat_user_tables WHERE n_live_tup IS NULL;
Compare these to production. Differences indicate the backup is incomplete or stale.
For critical data integrity, run a sampled checksum against production:
-- On production:
SELECT md5(string_agg(id::text || created_at::text, ',' ORDER BY id))
FROM orders
WHERE id BETWEEN 1000 AND 1100;
-- Same on restored:
-- Should match exactly.
Point-in-time recovery drill
If you use continuous archiving for PITR, the most useful drill is:
- Pick a known timestamp from the past (within retention window).
- Identify a row that was modified after that timestamp.
- Restore to that timestamp.
- Verify the row reflects its pre-modification state.
This tests the entire WAL replay chain, not just the base backup. PITR works in theory until it doesn't, and the failure modes (missing WAL segments, corrupted archive, restore_command bugs) are not visible until you test.
Common surprises
Things I have seen go wrong on first drill:
- Missing extension binaries. Production has
pg_stat_statements,pgvector,postgis. The restore target does not have them installed at the OS level. Restore aborts onCREATE EXTENSION. - Mismatched Postgres version. Backup taken from Postgres 14, restore target is Postgres 16.
pg_basebackupis version-locked;pg_dumpis not. - Mismatched locale. glibc collation difference between source and target machine. Indexes that depend on collation may need to be reindexed.
- Permission errors. Backup files were written as one user; restore is running as another.
chownfirst. - WAL gaps. Restore command can find some WAL but not all. The gap stops recovery.
- Bigger data than expected. Production grew faster than the backup retention assumed. The backup is fine but the restore target is too small.
Each of these is fixable. None of them are visible until you try.
What I track
After every drill:
- Was the restore successful (yes/no).
- Time from start to first query: minutes.
- Time from start to fully verified: minutes.
- Anything that surprised us.
Over time, the trend tells you whether your backup capability is improving or degrading. The ideal is that drills get faster as you automate the steps; the warning sign is that drills find more surprises as the production schema evolves.
A pragmatic rhythm
My default cadence:
- Quarterly: full restore drill on the latest production backup.
- Monthly: restore drill on a recent backup, but only the database-init step (don't run full restore unless you have time).
- Continuously: monitor that backups are completing and that backup file checksums match what was uploaded.
And one rule everyone knows: a backup that has not been restored within the last 90 days is not a backup. It is hope.
The two teams I helped recover both moved to this rhythm afterward. Neither has had a backup-fails-during-incident moment since.