AWS Aurora Postgres Replica Lag: Different from Vanilla, Different to Diagnose
Aurora's replica lag has different mechanics than vanilla streaming replication. The dashboard metric "replica lag" can be misleading. Here is what it actually measures.
Notes for the problems that show up after launch: bad plans, awkward migrations, index debt, vacuum pressure, replica lag, and the small decisions that make PostgreSQL easier to operate.
Aurora's replica lag has different mechanics than vanilla streaming replication. The dashboard metric "replica lag" can be misleading. Here is what it actually measures.
Lock incidents look mysterious until you map the blockers. Start with who is waiting, who is holding, and whether the application creates the pattern.
Cloud SQL maintenance windows are mostly fine and occasionally not. Here is what happens during them, what gets restarted, and how to make sure your application survives.
Azure Flex's defaults are conservative. The Server parameters blade is where most of the meaningful tuning happens. Here are the parameters I always touch.
Postgres on Kubernetes is feasible now in a way it was not five years ago. The operators are mature, the storage is good enough, and the failure modes are tractable. Here is what to know.
N+1 is the most common ORM-induced performance bug. The query count tells the story; the application code makes it impossible to spot at review time.
Prepared statements skip the planning step on repeated execution. Sometimes that is a 5x speedup. Sometimes it is a 50x slowdown. Knowing the difference matters.
Wrapping a transaction around an external API call sounds careful and is actually one of the worst patterns in production database code.
Backups matter only if the restore path is known. Choose logical or physical backups based on recovery goals, WAL history, and rehearsal discipline.
Most application-side retry logic is wrong. It either retries everything (and corrupts data) or nothing (and surfaces transient failures to users). Here is the right framework.
A batch job that worked fine in dev can saturate production. The fixes are not exotic — chunk size, lock duration, retry budget — but most teams skip them.
Using a Postgres table as a job queue used to be a recipe for contention. SKIP LOCKED makes it tractable. Here is the pattern that actually scales.