January 24, 20268 min read

Postgres Checkpoint Tuning: The Latency Spike Nobody Sees Coming

Latency climbs for 30 seconds, then recovers. CPU is fine. Queries are fine. Nobody deployed anything. Welcome to the checkpoint problem, and the four settings that make it go away.

There is a very specific shape to a checkpoint problem. The application is fine for a while. Then write latency climbs for 30 to 60 seconds. Then it recovers. CPU is not maxed. Queries are not obviously worse. Nobody deployed anything. The shape repeats every 5 to 15 minutes, give or take.

The first time I saw this in production, we spent two days blaming the application before we put pg_stat_bgwriter on the same graph as request latency. The two were perfectly correlated. The fix took 20 minutes once we knew where to look.

What is actually happening

Postgres does not write your data to disk at the moment of COMMIT. It writes the WAL record (write-ahead log), and the actual table page sits in shared_buffers waiting to be flushed. A checkpoint is the periodic event where Postgres says, "OK, time to actually flush dirty pages from shared_buffers to disk so I do not have to keep them in WAL forever."

When a checkpoint kicks in, it does a burst of disk writes. On default settings, that burst can be enough to saturate your I/O for tens of seconds. Concurrent writes from the application have to wait. Latency climbs. Then the checkpoint finishes, latency returns to normal.

This is a default-settings problem. Postgres ships configured for a small workload. A serious production database needs different defaults.

The four settings that matter

checkpoint_timeout — how often a checkpoint runs by time. Default is 5 minutes. For a busy database, 15 to 30 minutes is more reasonable. Longer means each checkpoint flushes more pages, but it also means fewer checkpoints, less overall write amplification, and better cache hit rates between checkpoints.

max_wal_size — soft cap on WAL accumulated between checkpoints. Default is 1GB, which is comically low for any database doing real writes. A modern busy database wants 8GB to 64GB. The bigger this is, the less likely a checkpoint is forced by WAL pressure rather than by time.

checkpoint_completion_target — how much of the checkpoint interval to spread the writes across. Default is 0.9. Leave it. Older guides talk about tuning this; modern Postgres handles it well by default.

bgwriter_lru_maxpages and bgwriter_delay — the background writer flushes a small number of dirty pages continuously, smoothing out the work the checkpoint has to do. Bumping bgwriter_lru_maxpages from the default 100 to 500 or 1000 means more pages flushed continuously, less to flush at checkpoint time.

The shape I usually settle on

For a busy production database (RDS or self-managed, doesn't matter):

ALTER SYSTEM SET checkpoint_timeout = '15min';
ALTER SYSTEM SET max_wal_size = '16GB';
ALTER SYSTEM SET checkpoint_completion_target = 0.9;
ALTER SYSTEM SET bgwriter_lru_maxpages = 500;
ALTER SYSTEM SET bgwriter_delay = '20ms';

SELECT pg_reload_conf();

These are starting points. The exact values depend on workload.

How to know if you are checkpoint-bound

The single most useful query:

SELECT * FROM pg_stat_bgwriter;

Look at:

checkpoints_timed vs checkpoints_req: the first is checkpoints triggered by checkpoint_timeout, the second by max_wal_size. If checkpoints_req is more than 10% of total, your max_wal_size is too low — checkpoints are being forced by WAL pressure.
buffers_checkpoint: pages written by checkpoints. Compare to buffers_clean (background writer) and buffers_backend (writes done synchronously by the queries themselves). If buffers_backend is the dominant number, the application is doing the work the bgwriter and checkpointer should be doing — that is bad.
checkpoint_write_time and checkpoint_sync_time: how long the average checkpoint spent writing and syncing. If sync time is large, your underlying disk is the bottleneck.

A healthy database has most checkpoints _timed, low buffers_backend, and checkpoint_write_time consuming most of the checkpoint interval (i.e., spread out, not bursty).

A real diagnostic walkthrough

The last time I tuned this, the customer's metric was "writes get slow for 40 seconds every 8 minutes." pg_stat_bgwriter showed:

checkpoints_timed: 200, checkpoints_req: 1100. The 1:5 ratio meant max_wal_size was the binding constraint, not the timer.
buffers_backend was 4x buffers_clean. Background writer was barely keeping up.
checkpoint_sync_time averaged 38 seconds per checkpoint.

The fix:

Bumped max_wal_size from 1GB to 16GB. Checkpoints are now mostly time-triggered.
Bumped bgwriter_lru_maxpages from 100 to 1000. Background writer doing more work.
Bumped checkpoint_timeout from 5 minutes to 15 minutes.

The write latency spikes disappeared. Total disk writes per hour went down because fewer checkpoints meant less write amplification of frequently changed pages.

The thing nobody tells you

Increasing checkpoint_timeout and max_wal_size makes recovery time after a crash longer. Postgres has to replay all WAL accumulated since the last checkpoint. If your last checkpoint was 30 minutes ago and you wrote 64GB of WAL, recovery could take several minutes.

For most teams this is fine — Postgres crashes are rare and a few-minute recovery time is acceptable. If you have stricter recovery time objectives, the conversation moves to replication and failover, not to shorter checkpoints.

The bottom line

If your write latency has periodic spikes that defy explanation, look at checkpoints first. The default settings are wrong for any serious workload. Four settings, one reload, problem solved most of the time. The remaining cases are usually about disk I/O capacity, which is a different conversation entirely.

← Back to all articles