5 min read

pgbench for Load Testing: Useful, Limited, and Often Misinterpreted

pgbench measures Postgres throughput under a synthetic workload. It tells you something useful, but only if you understand what its numbers mean.

pgbench is the standard tool for benchmarking Postgres. It runs a configurable workload against the database and reports transactions per second. The number it gives you is precise. What that number means in production terms is less precise.

Here is what pgbench is good for, what it is not, and how to use it without misleading yourself.

The basics

# Initialize the test database
pgbench -i -s 100 testdb

# Run a 5-minute test with 50 concurrent clients
pgbench -c 50 -j 4 -T 300 testdb

Flags:

  • -s 100: scale factor. Creates ~10M rows of test data.
  • -c 50: concurrent clients (database connections).
  • -j 4: parallel threads on the client side.
  • -T 300: run duration in seconds.

Output:

latency average = 5.234 ms
tps = 9551.234567 (without initial connection time)

The latency is per-transaction average. The TPS is total transactions per second.

What pgbench's default workload measures

The default workload is a banking-style scenario: select from pgbench_accounts, update balance, insert a transaction record. Specifically:

BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES ...;
END;

This is a write-heavy mixed workload. It exercises B-tree updates, WAL generation, lock contention.

The TPS number reflects this specific shape. It does NOT reflect:

  • Your application's actual queries.
  • Your data shape or distribution.
  • Your read/write ratio.

Useful as a relative number across configurations. Misleading if interpreted as "how fast my Postgres is."

Custom workloads

For more realistic testing, write a custom script:

-- ~/my_workload.sql
\set aid random(1, 1000000)
SELECT * FROM orders WHERE customer_id = :aid;
SELECT * FROM customers WHERE id = :aid;

Run with:

pgbench -c 50 -j 4 -T 300 -f ~/my_workload.sql testdb

Now pgbench measures your specific query mix. The TPS number reflects your actual workload's throughput potential.

For weighted multi-script tests:

pgbench -c 50 -j 4 -T 300 \
  -f read_workload.sql@70 \
  -f write_workload.sql@30 \
  testdb

70% of transactions run the read script, 30% the write script.

What the latency tells you

The "latency average" is per-transaction mean latency. This is interesting but rarely the metric to optimize.

More useful is the latency distribution:

pgbench -c 50 -j 4 -T 300 --report-per-command testdb

With --report-per-command, pgbench reports per-statement latency, which lets you see which parts of the transaction are slow.

For p99 latency, use -l --aggregate-interval=1:

pgbench -c 50 -j 4 -T 300 -l --aggregate-interval=1 testdb

This logs per-second aggregates. Post-process with sort and quantile calculations to get p99.

Common misuses

Comparing to internet benchmarks. "This blog says Postgres does 100k TPS, why am I only getting 10k?" The blog ran a different workload on different hardware with different settings. Numbers across setups are not comparable.

Treating TPS as a hardware property. TPS depends on workload shape, dataset size, settings, hardware, and concurrency level. Changing any of those changes TPS by 10x. There is no single "this database does X TPS."

Running too short. A 30-second pgbench run measures cache-warm performance. Real workloads have cold cache, checkpoint pauses, vacuum runs, etc. Run for at least 5 minutes to get a representative number.

Running on the same machine as Postgres. pgbench's own CPU usage competes with Postgres. The number is lower than what a remote client would see.

Useful pgbench patterns

Comparing settings. Run pgbench with one config, change a setting, run again. The TPS delta tells you whether the setting helped.

Must hold everything else constant: same hardware, same data, same workload, same duration.

Stress testing connection limits. Set -c very high to find the connection count where the database starts thrashing.

Identifying I/O ceiling. Run pgbench with a large dataset that does not fit in memory. The TPS asymptotically approaches the disk's I/O limit.

Replica lag testing. Run a write workload via pgbench against the primary; measure replica lag (pg_stat_replication.replay_lag) under sustained load. Tells you the lag profile under realistic write rates.

What pgbench cannot tell you

  • How your application will actually perform. Application logic, network latency, ORM overhead, and many other factors are outside Postgres.
  • How a real outage will look. pgbench is steady-state. Real workloads have spikes, failures, and weird shapes.
  • The right index strategy. TPS does not tell you whether you should add an index; query patterns and EXPLAIN do.

For those questions, real production traffic with proper monitoring is the only honest answer.

When to use it

The right use cases:

  • Comparing configurations. "Does increasing work_mem help my workload?" Run pgbench before and after.
  • Capacity sizing. "This instance can handle X TPS sustained — is that enough?" Run pgbench at the target load and see if latency stays acceptable.
  • Tuning a specific workload. Scripted workload that mimics your application; measure before and after each change.

What I do for capacity testing

For a new Postgres setup that I want to size correctly:

  1. Estimate target workload: requests per second, read/write ratio, concurrent connections.
  2. Build a custom pgbench script approximating that workload.
  3. Run at target load for 30+ minutes: measure TPS, latency p99, CPU, I/O.
  4. Run at 2x target load: confirm the system has headroom.
  5. Run at 5x target load: identify the breaking point.

The results tell me whether the instance is sized correctly, where the bottleneck is, and how much headroom exists for growth.

This is more rigor than most teams apply. It catches sizing mistakes before they become production problems.