13 min read

PostgreSQL Backup Strategies: Restores, WAL, and Test Drills

Backups matter only if the restore path is known. Choose logical or physical backups based on recovery goals, WAL history, and rehearsal discipline.

I've had this conversation too many times: "We have backups." "When did you last test a restore?" Silence.

An untested backup is not a backup. It's a hope. I've seen pg_dump jobs silently fail for weeks because of a permissions issue — cron job exits zero, no actual dump created, nobody noticed. When they needed to restore, the most recent good backup was 6 weeks old.

That story has a very bad ending.

Two Approaches: Logical vs Physical

Logical backups (pg_dump): Export data as SQL statements or custom binary format. Human-readable (mostly), cross-version compatible, selective (you can dump one table or one schema). Cons: slow on large databases, no point-in-time recovery, doesn't capture everything (roles, tablespaces need separate handling).

Physical backups (pg_basebackup + WAL archiving): Copy the actual data files. Combined with WAL archiving, you get point-in-time recovery — restore to any moment in the last N days. Faster for large databases, captures everything. Cons: must restore to same (or newer compatible) PostgreSQL version, bigger backup files.

For anything important, you want both.

pg_dump: The Right Way

# Custom format (recommended — smaller, can restore single tables)
pg_dump -h localhost -U postgres -d mydb -F c -f mydb_backup.dump

# Parallel dump for large databases
pg_dump -h localhost -U postgres -d mydb -F d -j 4 -f mydb_backup_dir/

# Verify the dump is valid
pg_restore --list mydb_backup.dump | head -50

# Restore
pg_restore -h localhost -U postgres -d mydb_new -F c mydb_backup.dump

# Restore single table from dump
pg_restore -h localhost -U postgres -d mydb -F c -t orders mydb_backup.dump

Use custom format (-F c) or directory format (-F d), not plain SQL (-F p). Plain SQL can't be used for partial restores and is usually much larger.

For globals (roles, tablespaces):

pg_dumpall -h localhost -U postgres --globals-only -f globals.sql

Automating pg_dump

#!/bin/bash
# backup-postgres.sh
set -euo pipefail

BACKUP_DIR="/var/backups/postgres"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="mydb"
RETENTION_DAYS=14

mkdir -p "$BACKUP_DIR"

# Dump
pg_dump -h localhost -U postgres -d "$DB_NAME" -F c \
  -f "$BACKUP_DIR/${DB_NAME}_${DATE}.dump"

# Verify dump is non-empty and readable
DUMP_SIZE=$(stat -c%s "$BACKUP_DIR/${DB_NAME}_${DATE}.dump")
if [ "$DUMP_SIZE" -lt 1000 ]; then
  echo "ERROR: Backup file too small, likely failed" >&2
  exit 1
fi

pg_restore --list "$BACKUP_DIR/${DB_NAME}_${DATE}.dump" > /dev/null

# Cleanup old backups
find "$BACKUP_DIR" -name "${DB_NAME}_*.dump" -mtime +${RETENTION_DAYS} -delete

echo "Backup completed: ${DB_NAME}_${DATE}.dump (${DUMP_SIZE} bytes)"

The size check and pg_restore --list check catch most silent failures. This script exits non-zero on failure, so your cron monitoring (or alerting on failed jobs) will catch it.

WAL Archiving and Point-in-Time Recovery

This is the professional setup. Every change PostgreSQL makes is first written to the WAL. If you archive WAL segments as they're created, you can restore to any point in time within your archive window.

# postgresql.conf
wal_level = replica     # minimum for WAL archiving
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'

# Or to S3:
archive_command = 'aws s3 cp %p s3://my-backup-bucket/wal/%f'

Then take base backups periodically:

pg_basebackup -h localhost -U replicator -D /var/backups/postgres/basebackup   -P -Xs -R
# -Xs: stream WAL during backup
# -R: create standby.signal for replica use

For PITR restore:

# Restore base backup
rsync -a /var/backups/postgres/basebackup/ /var/lib/postgresql/data/

# Create recovery config
cat > /var/lib/postgresql/data/recovery.conf << EOF
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '2026-03-15 14:30:00'
recovery_target_action = 'promote'
EOF

# Start PostgreSQL — it will apply WAL until target time
pg_ctl start

pgBackRest: What I Actually Use

pgBackRest does all of this — base backups, WAL archiving, PITR — but properly. Compression, encryption, S3/Azure/GCS support, backup verification, parallel restore. It's what I deploy on production systems.

# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2
repo1-retention-diff=7

# Or use S3:
repo1-type=s3
repo1-s3-bucket=my-backup-bucket
repo1-s3-region=us-east-1
repo1-s3-endpoint=s3.amazonaws.com

[mydb]
pg1-path=/var/lib/postgresql/data
# Initialize repository
pgbackrest --stanza=mydb stanza-create

# Full backup
pgbackrest --stanza=mydb backup --type=full

# Differential backup (only changes since last full)
pgbackrest --stanza=mydb backup --type=diff

# Verify backup integrity
pgbackrest --stanza=mydb check

# List backups
pgbackrest --stanza=mydb info

# Restore to specific time
pgbackrest --stanza=mydb restore   --target="2026-03-15 14:30:00"   --target-action=promote

The Part Everyone Skips: Testing Restores

This is the uncomfortable truth. Your backup strategy is meaningless until you've proven you can restore from it under pressure.

What I do on every database I'm responsible for:

  1. Monthly restore test: Take the most recent backup, restore to a fresh test instance, verify data looks right. This catches backup corruption, configuration drift, and operational gaps.

  2. PITR test: Pick a timestamp in the past, restore to that point, verify the expected data state. This proves your WAL archiving is working.

  3. Timed restore drill: Actually time how long a restore takes. If your backup is 200GB, you need to know it takes 45 minutes, not assume it's fast. That 45 minutes is your minimum RTO.

# Simple restore test script
#!/bin/bash
RESTORE_HOST="test-postgres"
BACKUP_FILE="latest"

pgbackrest --stanza=mydb restore   --pg1-path=/tmp/restore_test   --db-include=mydb

# Start a PostgreSQL instance pointing at restore
pg_ctl -D /tmp/restore_test start

# Verify row counts match expected
psql -h localhost -p 5433 -U postgres -d mydb -c "
  SELECT 'orders' AS table, count(*) FROM orders
  UNION ALL
  SELECT 'users', count(*) FROM users;
"

pg_ctl -D /tmp/restore_test stop
rm -rf /tmp/restore_test

Managed Service Backups

RDS, Cloud SQL, and Azure Database all have automated backups. They're fine for most use cases, but:

  • They don't protect against logical corruption (someone dropped a table)
  • PITR windows are typically 7-35 days
  • Cross-region restore often isn't instant
  • You're dependent on the provider for restore execution

I still recommend taking your own logical backups (pg_dump) in addition to managed service backups. The operational overhead is low and it gives you options.

MonPG can track when your last successful backup ran and alert if it hasn't run recently — one of those "simple but saved us" monitoring checks.

Don't be the team with 6-week-old backups. Test your restores.