7 min read

Point-in-Time Recovery in Postgres: The Capability You Hope You Never Use

PITR lets you restore the database to any moment within your retention window. The feature is well-documented; the operational reality is messier.

PITR is the feature that lets you say: "restore the database to 2 p.m. yesterday, before the bad migration." In theory, this is straightforward. In practice, it depends on three things working together — base backups, continuous WAL archiving, and the recovery process — and any of them can be subtly broken.

The last team I helped through a PITR found that their archive_command had been failing silently for two weeks. They could restore to two weeks ago. Anything more recent was gone.

Here is what to verify before you need it.

What PITR requires

Three pieces:

  1. A base backup taken via pg_basebackup or equivalent.
  2. Continuous WAL archiving, where every WAL segment from the moment of the base backup forward has been shipped somewhere durable.
  3. A working restore_command that can fetch those WAL segments during recovery.

If any of these is broken, PITR cannot reach the moment you want.

Setting up the foundation

-- postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://your-bucket/wal/%f'
wal_level = replica

The archive_command is your responsibility. It must:

  • Return exit code 0 on success.
  • Be idempotent (it can be retried safely).
  • Handle network failures by failing fast (the next attempt picks up).
  • Not delete the file on the source — Postgres deletes it itself once archived.

For S3, the standard form is aws s3 cp %p s3://bucket/wal/%f --no-progress. For more sophisticated needs (pgBackRest, WAL-G), use those tools.

Monitoring the archive

The one query to know:

SELECT
  archived_count,
  last_archived_wal,
  last_archived_time,
  failed_count,
  last_failed_wal,
  last_failed_time
FROM pg_stat_archiver;

What you want:

  • last_archived_time is recent (within the last few minutes).
  • failed_count is low and stable.
  • last_failed_time is older than last_archived_time.

If last_failed_time > last_archived_time, you have a quiet incident. WAL is accumulating on the primary because it cannot be archived. The disk fills. PITR breaks.

The alert: last_failed_time newer than last_archived_time for more than 5 minutes → page.

How recovery actually works

Given a base backup and the WAL archive, the recovery process is:

# 1. Restore the base backup to a fresh data directory
rm -rf /var/lib/postgres/16/main/*
tar -xf /backup/base.tar.gz -C /var/lib/postgres/16/main

# 2. Configure recovery
cat > /var/lib/postgres/16/main/postgresql.auto.conf << 'EOF'
restore_command = 'aws s3 cp s3://your-bucket/wal/%f %p'
recovery_target_time = '2025-01-15 14:00:00 UTC'
recovery_target_action = 'promote'
EOF

# 3. Tell Postgres this is a recovery start
touch /var/lib/postgres/16/main/recovery.signal

# 4. Start Postgres
pg_ctl -D /var/lib/postgres/16/main start

Postgres replays WAL from the base backup forward, stopping at the configured target time. Then it promotes itself to a writable cluster.

Recovery target options

Four ways to specify the target:

  • recovery_target_time — restore to a specific timestamp. Most common.
  • recovery_target_xid — restore to a specific transaction ID. Useful if you know the exact transaction that broke things.
  • recovery_target_lsn — restore to a specific WAL position. The most precise but requires you to identify the LSN.
  • recovery_target_name — restore to a named restore point set earlier with pg_create_restore_point(). Useful if you proactively named a moment before a risky operation.

For most incidents, recovery_target_time is the answer. "Restore to 2 p.m. yesterday" is a clear ask.

What goes wrong on first try

Things I have seen during real PITRs:

WAL gap. A segment was lost between base backup and recovery target. Recovery aborts at the gap. You can either accept the truncated state or find the missing WAL (occasionally it is in a backup elsewhere).

Restore command fails authentication. The restore environment does not have S3 credentials. The base backup unpacks fine, but recovery cannot fetch any WAL. Set AWS_ACCESS_KEY_ID etc. in the environment Postgres runs as.

Postgres version mismatch. The base backup is from Postgres 14, the recovery target is Postgres 16. Cannot recover. Have to install matching binary.

The recovery target time is too far in the past. The base backup itself is newer than the target time, so there is no way to roll back. You need an older base backup or you cannot reach that moment.

Promotion did not happen. Without recovery_target_action = 'promote', Postgres reaches the target and pauses in standby mode. Add the directive or call pg_promote() afterward.

Drills

The drill cadence I recommend:

  • Quarterly: full PITR drill. Pick a known timestamp from a few hours ago. Restore to a fresh test instance. Verify the data reflects that moment.
  • Monthly: light test. Confirm the most recent base backup is restorable.
  • Continuously: monitor archive health (above query) and base backup completion.

The drill tests catch the silent failures. The monitoring catches them as they start. Together they prevent the "we cannot restore" surprise during an incident.

On managed Postgres

Managed services (RDS, Cloud SQL, Azure Flex) handle the foundation for you. PITR is a click in the UI:

  • RDS: RestoreDBInstanceToPointInTime.
  • Cloud SQL: clone with point-in-time-recovery option.
  • Azure: similar.

The retention window is plan-dependent (typically 7-35 days). Within that window, PITR "just works" with the cloud's machinery.

What is still your problem on managed:

  • Verifying the cloud's PITR capability actually works for your workload (drills).
  • Backups that go further back than the cloud's retention window (handle separately with pg_dump).
  • Recovery to a different region (varies by cloud).

Coordinating PITR with application state

An often-overlooked detail: PITR restores the database, not the application's view of the world. After recovery, the application may have stale caches, in-flight messages, partial state from the post-target time.

For a typical web app, this is fine — point connections at the recovered database, redeploy the application. For a system with persistent application state (in-memory queues, attached blob storage), PITR alone is incomplete. Plan how to bring the application back to a consistent point too.

What I commit to before relying on PITR

  • Archive command working, monitored, alerted.
  • Base backups completing daily, success monitored.
  • Restore drill within the last 90 days.
  • A documented runbook for "do PITR to time X." Everyone on call has read it.
  • If managed: verified that the cloud's PITR works for our actual data scale.

The last point is more important than people realize. Cloud PITR is not always tested at the scale your data has grown to. The first time you try, you might find the cloud's restore takes longer than the target downtime allows. Drills surface this before incidents do.