Google Cloud SQL applies maintenance updates within a configurable maintenance window. The default is one hour weekly, time of your choosing. The actual maintenance is mostly transparent — until it isn't.
The last team I helped through a Cloud SQL maintenance issue learned that the application's connection pool did not handle the brief failover gracefully. Five minutes of partial outage during what should have been a routine update.
Here is what happens, what gets restarted, and how to make sure your stack handles it cleanly.
What "maintenance" means on Cloud SQL
Google pushes patches and upgrades into the Cloud SQL fleet. For your specific instance, maintenance lands during your configured window. The update typically:
- Applies to the underlying Postgres binary or the surrounding Cloud SQL infrastructure.
- Requires a brief restart of the database process.
- Lasts a few minutes for most updates; can be 15-30 minutes for major updates.
- Triggers a brief failover if the instance is in HA configuration (automatic failback after the primary is patched).
The instance is unavailable for the duration of the restart. Connections drop. New connections fail until the database accepts them again.
Configuring the window
In the Cloud SQL console:
- Maintenance window: pick a day of week and time. Choose a low-traffic window. Default is auto.
- Order of update: "Earlier" or "Later". Earlier is canary; you get updates ahead of "Later" instances. Useful for staging environments to detect issues before production.
- Notification preferences: Cloud SQL emails you 7 days before maintenance is scheduled. The email tells you what will be done; you can defer it once.
For a production database, my preferences:
- Window: Sunday 02:00-03:00 local time (or whatever your lowest-traffic time is).
- Order: Later (production is the conservative choice).
- Notifications: enabled, with the email going to the on-call rotation.
What happens during the restart
In order:
- Cloud SQL initiates the failover (if HA configured) — promotes the standby to primary.
- New connections start landing on the new primary. Existing connections are dropped.
- The old primary is patched and brought back as the new standby.
- Optionally, Cloud SQL can fail back (rare; not the default).
For non-HA instances, step 1 is skipped. The instance is just restarted in place; downtime is the time it takes to stop, patch, start.
For HA instances, the actual user-visible downtime is the failover window — typically 30-90 seconds for a clean handover.
What the application sees
During the maintenance:
- Open connections close abruptly. Pending queries error.
- New connections to the writer endpoint fail for a brief window during failover.
- The application's connection pool gets refilled with connections to the new primary.
The application has to:
- Detect dropped connections and retry on a fresh connection.
- Handle a brief window where connections fail entirely.
- Not corrupt state if a transaction was interrupted.
Things that go wrong
The failure modes I have seen:
Connection pool not configured for retries. The pool reports the connections as bad, then... gives up. The application throws errors at users until the next request that triggers a fresh connection attempt.
Fix: configure the pool for connection validation and aggressive replacement of dead connections.
Long transactions during maintenance. A transaction that was open when the maintenance started gets rolled back when its connection closes. The application has to detect this and retry the entire operation.
Fix: keep transactions short. Do not start long transactions if maintenance is imminent.
Cached query plans tied to the old session. Some libraries cache prepared statements per-connection. After failover, those statements need to be re-prepared on new connections. If the library does not handle this transparently, queries fail with "prepared statement does not exist."
Fix: most modern libraries handle this. Older or custom code might not.
Application uses the IP address instead of the DNS name. Cloud SQL connections should use the Cloud SQL connector or the instance's DNS name. IP addresses can change during maintenance.
Fix: use the Cloud SQL Auth Proxy or the appropriate connector for your language.
Drilling for maintenance
The right preparation:
- Run a maintenance drill in staging. Manually trigger a failover (Cloud SQL Console > Operations > Failover). Confirm the application reconnects cleanly.
- Check connection pool config. Validate that dead connections are detected and replaced. Test by killing the database mid-request.
- Set up application monitoring. During maintenance, you want to see error rate spike briefly and recover. If errors persist, the application has a reconnection issue.
- Document the runbook. What to do if maintenance does not complete cleanly. Cloud SQL support contact, escalation procedure.
Special case: PostgreSQL major version upgrades
Major version upgrades on Cloud SQL are not maintenance-window operations. They are explicit user actions: "Upgrade my Cloud SQL instance from Postgres 14 to 16."
Major upgrades:
- Are scheduled by you, not automatic.
- Take longer (10-60 minutes typically).
- Have a rollback option (snapshot before, restore if needed).
- Affect application compatibility (extension versions, syntax changes, behavior shifts).
For major upgrades, the pre-flight checks I run (see the major version upgrade post) all apply. Do not treat them as routine maintenance.
What I commit to
For any Cloud SQL production database:
- Maintenance window configured for low-traffic time.
- Notification on, going to the right people.
- Quarterly maintenance drill in staging.
- Application connection pool configured for resilience.
- Documented runbook for maintenance windows.
With these in place, the routine weekly maintenance is a non-event. Without them, every maintenance window is a small risk that adds up over time.