CPU, Memory, and I/O8 min read

Postgres Checkpoint Tuning: The Latency Spike Nobody Sees Coming

Checkpoint problems show up as periodic latency spikes. The work is smoothing writes, sizing WAL, and proving checkpoints are the thing users feel.

Checkpoint pain often looks like a ghost in the latency graph.

Every so often, writes slow down, query latency climbs, then everything recovers. No deploy. No obvious slow query. CPU may look fine. The missing piece is often that Postgres is flushing dirty pages harder than your storage can absorb comfortably.

Checkpoint tuning is not about eliminating writes. It is about making them predictable.

The real pain is periodic latency nobody owns

Users do not care that the spike came from checkpoint behavior. They care that the app feels randomly slow. Your job is to connect latency spikes to checkpoint timing, WAL volume, write pressure, and storage capacity.

Once the pattern is proven, the fix is usually smoothing: larger WAL budget, sane checkpoint timing, and enough I/O headroom for the workload.

What is actually happening

Postgres does not write your data to disk at the moment of COMMIT. It writes the WAL record (write-ahead log), and the actual table page sits in shared_buffers waiting to be flushed. A checkpoint is the periodic event where Postgres says, "OK, time to actually flush dirty pages from shared_buffers to disk so I do not have to keep them in WAL forever."

When a checkpoint kicks in, it does a burst of disk writes. On default settings, that burst can be enough to saturate your I/O for tens of seconds. Concurrent writes from the application have to wait. Latency climbs. Then the checkpoint finishes, latency returns to normal.

This is a default-settings problem. Postgres ships configured for a small workload. A serious production database needs different defaults.

The four settings that matter

checkpoint_timeout — how often a checkpoint runs by time. Default is 5 minutes. For a busy database, 15 to 30 minutes is more reasonable. Longer means each checkpoint flushes more pages, but it also means fewer checkpoints, less overall write amplification, and better cache hit rates between checkpoints.

max_wal_size — soft cap on WAL accumulated between checkpoints. Default is 1GB, which is comically low for any database doing real writes. A modern busy database wants 8GB to 64GB. The bigger this is, the less likely a checkpoint is forced by WAL pressure rather than by time.

checkpoint_completion_target — how much of the checkpoint interval to spread the writes across. Default is 0.9. Leave it. Older guides talk about tuning this; modern Postgres handles it well by default.

bgwriter_lru_maxpages and bgwriter_delay — the background writer flushes a small number of dirty pages continuously, smoothing out the work the checkpoint has to do. Bumping bgwriter_lru_maxpages from the default 100 to 500 or 1000 means more pages flushed continuously, less to flush at checkpoint time.

The shape I usually settle on

For a busy production database (RDS or self-managed, doesn't matter):

ALTER SYSTEM SET checkpoint_timeout = '15min';
ALTER SYSTEM SET max_wal_size = '16GB';
ALTER SYSTEM SET checkpoint_completion_target = 0.9;
ALTER SYSTEM SET bgwriter_lru_maxpages = 500;
ALTER SYSTEM SET bgwriter_delay = '20ms';

SELECT pg_reload_conf();

These are starting points. The exact values depend on workload.

How to know if you are checkpoint-bound

The single most useful query:

SELECT * FROM pg_stat_bgwriter;

Look at:

  • checkpoints_timed vs checkpoints_req: the first is checkpoints triggered by checkpoint_timeout, the second by max_wal_size. If checkpoints_req is more than 10% of total, your max_wal_size is too low — checkpoints are being forced by WAL pressure.
  • buffers_checkpoint: pages written by checkpoints. Compare to buffers_clean (background writer) and buffers_backend (writes done synchronously by the queries themselves). If buffers_backend is the dominant number, the application is doing the work the bgwriter and checkpointer should be doing — that is bad.
  • checkpoint_write_time and checkpoint_sync_time: how long the average checkpoint spent writing and syncing. If sync time is large, your underlying disk is the bottleneck.

A healthy database has most checkpoints _timed, low buffers_backend, and checkpoint_write_time consuming most of the checkpoint interval (i.e., spread out, not bursty).

A real diagnostic walkthrough

The last time I tuned this, the customer's metric was "writes get slow for 40 seconds every 8 minutes." pg_stat_bgwriter showed:

  • checkpoints_timed: 200, checkpoints_req: 1100. The 1:5 ratio meant max_wal_size was the binding constraint, not the timer.
  • buffers_backend was 4x buffers_clean. Background writer was barely keeping up.
  • checkpoint_sync_time averaged 38 seconds per checkpoint.

The fix:

  1. Bumped max_wal_size from 1GB to 16GB. Checkpoints are now mostly time-triggered.
  2. Bumped bgwriter_lru_maxpages from 100 to 1000. Background writer doing more work.
  3. Bumped checkpoint_timeout from 5 minutes to 15 minutes.

The write latency spikes disappeared. Total disk writes per hour went down because fewer checkpoints meant less write amplification of frequently changed pages.

The thing nobody tells you

Increasing checkpoint_timeout and max_wal_size makes recovery time after a crash longer. Postgres has to replay all WAL accumulated since the last checkpoint. If your last checkpoint was 30 minutes ago and you wrote 64GB of WAL, recovery could take several minutes.

For most teams this is fine — Postgres crashes are rare and a few-minute recovery time is acceptable. If you have stricter recovery time objectives, the conversation moves to replication and failover, not to shorter checkpoints.

The bottom line

If your write latency has periodic spikes that defy explanation, look at checkpoints first. The default settings are wrong for any serious workload. Four settings, one reload, problem solved most of the time. The remaining cases are usually about disk I/O capacity, which is a different conversation entirely.