6 min read

AWS Aurora Postgres Replica Lag: Different from Vanilla, Different to Diagnose

Aurora's replica lag has different mechanics than vanilla streaming replication. The dashboard metric "replica lag" can be misleading. Here is what it actually measures.

Aurora is Postgres-compatible but not vanilla Postgres. Its replication architecture is fundamentally different — replicas read from shared storage, not from a stream of WAL. This produces lower lag in normal operation and different failure modes when things go wrong.

The metrics, the diagnostics, and the operational mental model are all different from vanilla. Mistaking one for the other has caused at least three incidents I have helped diagnose.

How Aurora replicas actually work

Vanilla Postgres replication: primary writes WAL → ships to replica → replica applies WAL to its local storage → data is now visible.

Aurora replication: primary writes to shared storage. Replicas read from the same shared storage. There is no WAL to ship; each replica catches up to the storage state lazily.

The consequence:

  • Lag is much lower in normal operation. Single-digit milliseconds is typical for Aurora; vanilla has 50-200ms baseline.
  • The cause of lag is different. It is not WAL ship rate; it is replica's ability to invalidate its caches and catch up to storage state.
  • Failure modes are different. A vanilla replica that falls behind can be re-synced via base backup. An Aurora replica that is having trouble usually needs to be restarted.

The metrics that matter on Aurora

AuroraReplicaLag (CloudWatch metric): how far behind the replica is in reading shared storage. Typically <100ms in healthy operation. Spikes during heavy write activity.

AuroraReplicaLagMaximum and AuroraReplicaLagMinimum: across all replicas in the cluster. The max is the worst-performing replica.

The key distinction: Aurora's AuroraReplicaLag measures storage-read lag, not the WAL-replay lag that vanilla Postgres has. They are related but not the same metric.

What can cause Aurora replica lag

1. Heavy DDL on the writer. A large index build creates a lot of cache invalidations. Replicas have to catch up to the new state.

2. Long-running queries on the replica. Aurora's replica has to maintain a consistent view for queries in flight. If a query holds an old state, the replica cannot fully catch up.

3. Aurora storage layer issues. Rare but real — the underlying shared storage can have hotspots. AWS surfaces this as VolumeReadIOPs and VolumeWriteIOPs saturating.

4. Replica is sized too small. A replica running at high CPU cannot process invalidations as fast as the writer produces them.

Diagnosis on Aurora

The metrics dashboard on RDS console shows AuroraReplicaLag first. Then:

-- On the replica
SELECT * FROM aurora_replica_status();

This Aurora-specific function shows replica state, including the LSN it has caught up to.

-- On the writer
SELECT * FROM aurora_global_db_status();

For global database setups, this shows cross-region replication status.

If the metric shows lag is high, the next question is which of the four causes above is responsible.

Lag spikes during DDL

The most common pattern I see: a CREATE INDEX or REINDEX on the writer causes Aurora replica lag to spike to 5-30 seconds. The application notices stale reads during this window.

This is normal Aurora behavior. The DDL is invalidating caches; replicas are catching up. The lag is temporary and resolves when the DDL completes.

For planned DDL during business hours, the right approach: route reads to the writer during the operation, or accept the staleness.

Long-running queries on replicas

Aurora's replicas hold a consistent snapshot for the duration of any query. A query that runs for an hour holds an hour-old snapshot, and during that hour the replica cannot fully catch up.

The diagnostic:

SELECT pid, now() - xact_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
  AND backend_type = 'client backend'
ORDER BY xact_start;

If there is a long-running query, killing it allows the replica to fully catch up. For analytical workloads that need long queries, run them on a dedicated reader without write traffic.

When Aurora is the wrong fit

Aurora's replication model is great when reads can tolerate some staleness, even brief spikes. It is less great when:

  • Strict read-after-write consistency is required. Vanilla Postgres with synchronous streaming replication can guarantee this. Aurora cannot, by design.
  • Cross-region replication with low lag. Aurora Global Database is a separate product; baseline lag is 100ms-1s depending on the region pair.
  • Specific workloads that interact poorly with shared storage. Heavy I/O bound tables on Aurora are sometimes slower than equivalent on vanilla RDS with provisioned IOPS.

For most read-heavy applications, Aurora is the right choice. The replication model is one of its core advantages over vanilla.

What I tell teams new to Aurora

The key mental shifts from vanilla Postgres:

  1. Replica lag is normally near-zero. Treat any sustained lag above 100ms as worth investigating.
  2. Lag spikes during DDL are normal. Plan for it; do not page on transient spikes during known DDL operations.
  3. Long queries on replicas hold up catch-up. Use a dedicated analytics replica for long queries.
  4. The metrics are different. AuroraReplicaLag is Aurora's main signal, not pg_stat_replication.

Most vanilla Postgres operational habits transfer to Aurora. The replication-specific mental model needs to be updated.

A real diagnostic

A team I worked with reported "intermittent stale reads on Aurora." The investigation:

  1. AuroraReplicaLag was sub-second normally, spiking to 30 seconds occasionally.
  2. The spikes correlated with a nightly bulk update job on the writer (1 million rows updated).
  3. During the update, replicas could not fully catch up because of the volume of cache invalidation.
  4. After the job, lag returned to normal within 1-2 minutes.

Fix: either run the bulk update during off-hours, or route the affected reads to the writer for the duration. They chose to schedule the job for 3 AM. Problem solved.

The lesson: Aurora's replica lag tracks workload patterns more closely than vanilla's. Heavy writes produce visible lag. Predicting and accommodating it is part of running on Aurora.