RDS gives you a Parameter Group with several hundred Postgres parameters. The default values are tuned for a generic mid-sized workload, which means they are slightly wrong for everyone. Most teams I work with leave them alone and live with the defaults; some change a couple based on a guide they found; very few do a deliberate review.
A deliberate review is worth a few hours of one engineer's time. The wins are usually 20-50% on workload throughput at no infrastructure cost. Here are the parameters I always look at and the rationale.
Memory: shared_buffers, work_mem, effective_cache_size
shared_buffers — Postgres's main page cache. Default on RDS is {DBInstanceClassMemory/32768} which works out to ~25% of memory on most instances. Reasonable.
For read-heavy workloads on instances with >16GB RAM, bumping to 30-35% sometimes helps. For write-heavy workloads, leaving at 25% is usually right because more shared buffers means more dirty pages to flush at checkpoint time.
On smaller instances (<8GB), the default tends to be slightly low. Move to 30%.
work_mem — per-operation memory for sorts and hash tables. RDS default is 4MB. This is too low for almost any production workload.
For OLTP, set globally to 16-32MB. For dashboards/analytics, set per-session via SET LOCAL work_mem = '256MB' for the heavy queries.
The trap: work_mem is per-operation. A query with 5 hash joins can allocate 5x work_mem. Multiplied by concurrent connections, the total memory pressure grows fast. Do not set work_mem globally above what your peak concurrent connection count can absorb.
effective_cache_size — the planner's estimate of total OS+Postgres cache. RDS default is {DBInstanceClassMemory*3/4}, which is about right.
This does not allocate memory; it is purely a planner hint. The planner uses it to decide whether to expect index pages to be cached. Setting it correctly produces better plans, especially for index-only scans.
Concurrency: max_connections, max_worker_processes
max_connections — RDS scales this with instance size, typically LEAST({DBInstanceClassMemory/9531392}, 5000). For a db.m5.xlarge (16GB), that is roughly 1700.
This is too high for most applications. A connection costs ~10MB of RAM in shared structures. 1700 connections × 10MB = 17GB just for connection overhead.
The right max_connections depends on your application. For most:
- Run PgBouncer in front. Set
max_connectionsto maybe 200-400. - Without PgBouncer, set it to (peak concurrent transactions × 1.2 + admin reserve).
Blindly raising max_connections is a common cause of OOM on RDS.
max_worker_processes — for parallel queries and replication. Default is 8. Raise to 16 if you do parallel scans on large tables.
Vacuum: autovacuum_*
The default autovacuum settings are conservative. Most production tables benefit from more aggressive defaults:
autovacuum_vacuum_scale_factor = 0.05 -- default 0.2
autovacuum_analyze_scale_factor = 0.02 -- default 0.1
autovacuum_naptime = '15s' -- default 60s
autovacuum_max_workers = 6 -- default 3
autovacuum_vacuum_cost_limit = 2000 -- default 200
autovacuum_vacuum_cost_delay = '2ms' -- default 2ms in modern PG
This is roughly 4x more aggressive than defaults. It runs autovacuum more often, with more workers, with less throttling. The trade is more I/O during vacuum, which is usually fine on modern instance storage.
For specific tables that are very high-churn or very large, override at the table level:
ALTER TABLE events SET (
autovacuum_vacuum_scale_factor = 0.02,
autovacuum_analyze_scale_factor = 0.01
);
Checkpoint: max_wal_size, checkpoint_timeout
See the dedicated checkpoint-tuning post. Default max_wal_size = 1GB is too low for most workloads:
max_wal_size = '8GB' -- or higher for write-heavy
checkpoint_timeout = '15min'
This reduces checkpoint frequency and the latency spikes that come with them.
Statement-level: statement_timeout, idle_in_transaction_session_timeout
Defaults are 0 (disabled). Set them:
statement_timeout = '60s' -- adjust per workload
idle_in_transaction_session_timeout = '5min'
These protect against runaway queries and connection-leak bugs. The values depend on your workload (analytical queries may need longer); the defaults of 0 are wrong for any production system.
Logging: log_min_duration_statement, log_lock_waits
For diagnostics:
log_min_duration_statement = '1s' -- log queries over 1 second
log_lock_waits = on
log_temp_files = 1024 -- log temp files over 1KB (catches sort spills)
log_autovacuum_min_duration = '1s'
log_checkpoints = on
These give you the diagnostic foundation when something goes wrong. RDS exports logs to CloudWatch; useful settings turn into useful CloudWatch queries.
SPL: shared_preload_libraries
The one parameter that requires reboot is shared_preload_libraries. The standard set:
shared_preload_libraries = 'pg_stat_statements,auto_explain,pg_buffercache,pg_cron'
Add what you need at instance creation if possible. Adding later requires a reboot, which is downtime.
Per-extension parameters
For pg_stat_statements:
pg_stat_statements.max = 10000
pg_stat_statements.track = all
pg_stat_statements.track_utility = off
For auto_explain:
auto_explain.log_min_duration = 5000 -- 5 seconds
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_format = json
auto_explain.log_nested_statements = on
These are the diagnostic foundation for slow-query investigation. Without them, you have aggregate stats but no per-execution visibility.
What I do for a new RDS instance
The checklist:
- Custom Parameter Group from the start (do not use the default — you cannot modify it).
- Apply the settings above as starting points.
- Adjust
work_memandmax_connectionsbased on instance size and workload. - Add
pg_stat_statements,auto_explain,pg_buffercachetoshared_preload_libraries. - Reboot once to apply.
For existing instances, I do this in waves: dynamic parameters first (no reboot), then SPL changes during a maintenance window.
What does not need to be tuned
A list of parameters I have learned to leave alone:
random_page_cost— defaults to 4 on RDS, which assumes spinning disks. For SSDs (which all modern RDS uses), 1.1 is more accurate, but the change rarely affects plans meaningfully.seq_page_cost— default 1, leave it.default_statistics_target— 100 is fine for most. Increase per-column when needed.synchronous_commit— leave atonunless you have a specific reason. Off trades durability for speed.wal_level—logicalif you use logical replication, elsereplica(the default is fine).
Validation
After changing parameters, verify with:
SELECT name, setting, source
FROM pg_settings
WHERE name IN ('work_mem', 'max_connections', 'autovacuum_vacuum_scale_factor');
The source column tells you where the value came from (configuration file, default, session, etc.). If a change you applied is not showing up, the Parameter Group did not apply yet — RDS sometimes needs a few minutes for pending-reboot parameters to propagate.
What I have measured
On a real production workload (typical e-commerce OLTP), tuning the above from defaults produced:
- ~30% reduction in p99 latency.
- ~40% reduction in CPU during checkpoints.
- ~50% reduction in vacuum-related backlog.
The instance size did not change. The cost was a few hours of careful review and one reboot.
RDS defaults are conservative; deliberate tuning is one of the highest ROI things a Postgres team can do.