Checkpoint pain often looks like a ghost in the latency graph.
Every so often, writes slow down, query latency climbs, then everything recovers. No deploy. No obvious slow query. CPU may look fine. The missing piece is often that Postgres is flushing dirty pages harder than your storage can absorb comfortably.
Checkpoint tuning is not about eliminating writes. It is about making them predictable.
The real pain is periodic latency nobody owns
Users do not care that the spike came from checkpoint behavior. They care that the app feels randomly slow. Your job is to connect latency spikes to checkpoint timing, WAL volume, write pressure, and storage capacity.
Once the pattern is proven, the fix is usually smoothing: larger WAL budget, sane checkpoint timing, and enough I/O headroom for the workload.
What is actually happening
Postgres does not write your data to disk at the moment of COMMIT. It writes the WAL record (write-ahead log), and the actual table page sits in shared_buffers waiting to be flushed. A checkpoint is the periodic event where Postgres says, "OK, time to actually flush dirty pages from shared_buffers to disk so I do not have to keep them in WAL forever."
When a checkpoint kicks in, it does a burst of disk writes. On default settings, that burst can be enough to saturate your I/O for tens of seconds. Concurrent writes from the application have to wait. Latency climbs. Then the checkpoint finishes, latency returns to normal.
This is a default-settings problem. Postgres ships configured for a small workload. A serious production database needs different defaults.
The four settings that matter
checkpoint_timeout — how often a checkpoint runs by time. Default is 5 minutes. For a busy database, 15 to 30 minutes is more reasonable. Longer means each checkpoint flushes more pages, but it also means fewer checkpoints, less overall write amplification, and better cache hit rates between checkpoints.
max_wal_size — soft cap on WAL accumulated between checkpoints. Default is 1GB, which is comically low for any database doing real writes. A modern busy database wants 8GB to 64GB. The bigger this is, the less likely a checkpoint is forced by WAL pressure rather than by time.
checkpoint_completion_target — how much of the checkpoint interval to spread the writes across. Default is 0.9. Leave it. Older guides talk about tuning this; modern Postgres handles it well by default.
bgwriter_lru_maxpages and bgwriter_delay — the background writer flushes a small number of dirty pages continuously, smoothing out the work the checkpoint has to do. Bumping bgwriter_lru_maxpages from the default 100 to 500 or 1000 means more pages flushed continuously, less to flush at checkpoint time.
The shape I usually settle on
For a busy production database (RDS or self-managed, doesn't matter):
ALTER SYSTEM SET checkpoint_timeout = '15min';
ALTER SYSTEM SET max_wal_size = '16GB';
ALTER SYSTEM SET checkpoint_completion_target = 0.9;
ALTER SYSTEM SET bgwriter_lru_maxpages = 500;
ALTER SYSTEM SET bgwriter_delay = '20ms';
SELECT pg_reload_conf();
These are starting points. The exact values depend on workload.
How to know if you are checkpoint-bound
The single most useful query:
SELECT * FROM pg_stat_bgwriter;
Look at:
checkpoints_timedvscheckpoints_req: the first is checkpoints triggered bycheckpoint_timeout, the second bymax_wal_size. Ifcheckpoints_reqis more than 10% of total, yourmax_wal_sizeis too low — checkpoints are being forced by WAL pressure.buffers_checkpoint: pages written by checkpoints. Compare tobuffers_clean(background writer) andbuffers_backend(writes done synchronously by the queries themselves). Ifbuffers_backendis the dominant number, the application is doing the work the bgwriter and checkpointer should be doing — that is bad.checkpoint_write_timeandcheckpoint_sync_time: how long the average checkpoint spent writing and syncing. If sync time is large, your underlying disk is the bottleneck.
A healthy database has most checkpoints _timed, low buffers_backend, and checkpoint_write_time consuming most of the checkpoint interval (i.e., spread out, not bursty).
A real diagnostic walkthrough
The last time I tuned this, the customer's metric was "writes get slow for 40 seconds every 8 minutes." pg_stat_bgwriter showed:
checkpoints_timed: 200,checkpoints_req: 1100. The 1:5 ratio meantmax_wal_sizewas the binding constraint, not the timer.buffers_backendwas 4xbuffers_clean. Background writer was barely keeping up.checkpoint_sync_timeaveraged 38 seconds per checkpoint.
The fix:
- Bumped
max_wal_sizefrom 1GB to 16GB. Checkpoints are now mostly time-triggered. - Bumped
bgwriter_lru_maxpagesfrom 100 to 1000. Background writer doing more work. - Bumped
checkpoint_timeoutfrom 5 minutes to 15 minutes.
The write latency spikes disappeared. Total disk writes per hour went down because fewer checkpoints meant less write amplification of frequently changed pages.
The thing nobody tells you
Increasing checkpoint_timeout and max_wal_size makes recovery time after a crash longer. Postgres has to replay all WAL accumulated since the last checkpoint. If your last checkpoint was 30 minutes ago and you wrote 64GB of WAL, recovery could take several minutes.
For most teams this is fine — Postgres crashes are rare and a few-minute recovery time is acceptable. If you have stricter recovery time objectives, the conversation moves to replication and failover, not to shorter checkpoints.
The bottom line
If your write latency has periodic spikes that defy explanation, look at checkpoints first. The default settings are wrong for any serious workload. Four settings, one reload, problem solved most of the time. The remaining cases are usually about disk I/O capacity, which is a different conversation entirely.