Postgres Failover Readiness: The Drill That Tells You If You Are Lying to Yourself
Failover is mostly fine when you do not need it and broken when you do. Here is how to know which you have.
Notes for the problems that show up after launch: bad plans, awkward migrations, index debt, vacuum pressure, replica lag, and the small decisions that make PostgreSQL easier to operate.
Failover is mostly fine when you do not need it and broken when you do. Here is how to know which you have.
If your Postgres disk is growing and you cannot identify the culprit, replication slots are usually the answer. Here is the diagnostic sequence.
Most production databases run application traffic as a superuser. This is convenient and wrong. Here is the role hierarchy that takes a few hours to set up and saves you in the worst case.
Replication monitoring is not one lag number. You need to know stale-read risk, slot retention, replay delay, WAL growth, and whether failover would help or hurt.
RLS pushes access control into the database. It is more secure than application-side filters and slightly slower than no filter at all. Here is the framework for using it well.
SSL on Postgres is a one-line config change to enable and a multi-day project to do correctly. The default settings are not good enough.
"Did anyone read the customer table outside expected hours" is a common audit question. The answer is harder to produce than it should be unless you set up auditing deliberately.
Most Postgres connection strings live in places they should not. Environment variables, config files, scripts, screenshots in Slack. Here is the discipline.
RDS exposes hundreds of Postgres parameters through Parameter Groups. About a dozen of them are worth tuning. Here are the ones I always change and why.
Aurora's replica lag has different mechanics than vanilla streaming replication. The dashboard metric "replica lag" can be misleading. Here is what it actually measures.
Lock incidents feel mysterious because the database looks idle while requests wait. The fix starts with blockers, waiters, transaction age, and code paths that take locks in different orders.
Cloud SQL maintenance windows are mostly fine and occasionally not. Here is what happens during them, what gets restarted, and how to make sure your application survives.