Long analytics queries on our replica kept getting cancelled mid-run. Turning on hot_standby_feedback stopped the cancellations instantly — and then the primary started bloating. That trade is the whole story.
PostgreSQL Topic Archive
Replication and WAL PostgreSQL Articles
Replica lag, WAL growth, failover readiness, hot standby behavior, and replication slots.
We scaled reads to a replica and started getting bug reports about data that 'disappeared' right after saving. The cause was replication lag, and the fix was being honest about which reads can tolerate it.
Write-heavy PostgreSQL systems usually fail through WAL pressure, checkpoint I/O, replication lag, or storage stalls. The fix starts with measuring the write path, not raising random knobs.
Logical replication is more flexible than physical and more fragile. Use it when you need partial replication, cross-version, or selective sync. Don't use it for HA.
Physical replication slots make sure replicas can catch up after a disconnect. They also make sure your primary's disk fills if a replica is gone and forgotten.
Read replicas are eventually consistent. The application's view of "after I wrote, my read should see it" is often wrong by milliseconds, sometimes by minutes.
Failover is mostly fine when you do not need it and broken when you do. Here is how to know which you have.
If your Postgres disk is growing and you cannot identify the culprit, replication slots are usually the answer. Here is the diagnostic sequence.
Replication monitoring is not one lag number. You need to know stale-read risk, slot retention, replay delay, WAL growth, and whether failover would help or hurt.
Aurora's replica lag has different mechanics than vanilla streaming replication. The dashboard metric "replica lag" can be misleading. Here is what it actually measures.
WAL problems usually look like disk problems too late. Monitor generation rate, checkpoints, archiving, replication lag, and slot retention before pg_wal owns the incident.