Cloud Providers9 min read

AWS RDS PostgreSQL Monitoring Field Guide

AWS RDS PostgreSQL removes server ownership, but it does not remove query risk. This field guide shows what to watch across CloudWatch, Database Insights, Enhanced Monitoring, and PostgreSQL itself.

AWS RDS PostgreSQL is managed, but the workload is still yours. RDS can patch the operating system, automate backups, expose CloudWatch metrics, and remove a lot of undifferentiated database administration. It cannot tell you whether a new deploy doubled one query family, whether an idle transaction is holding back vacuum, or whether a read replica is quietly serving stale data to a customer-facing path.

That is the trap with managed Postgres. The infrastructure looks safer, so teams wait too long to add database-level observability. By the time the first RDS incident arrives, the team has CloudWatch graphs for CPU and storage, but not enough evidence to answer the real question: which database behavior changed?

This guide is the practical monitoring baseline I would want before trusting a production RDS PostgreSQL fleet. It assumes you already have the AWS side visible and focuses on the gap between platform metrics and PostgreSQL evidence.

The native RDS signals are split across layers

RDS exposes useful signals, but they live in different places. CloudWatch metrics show instance pressure: CPU, memory-related indicators, read and write I/O, storage, connections, replication lag, and burst balance where it applies. CloudWatch Database Insights shows database load and lets you slice load by waits, SQL statements, hosts, or users. Enhanced Monitoring adds operating-system process and host detail through CloudWatch Logs when you enable the required IAM role.

Those layers are valuable, but none of them replaces PostgreSQL-level diagnosis. A CPU spike is a symptom. The database evidence is the query text, wait event, lock chain, autovacuum state, plan change, or connection pattern that explains the spike.

There is also a current AWS transition to handle. AWS announced that the Performance Insights console experience, flexible retention periods, and associated pricing end on June 30, 2026. The Performance Insights API continues, but AWS is steering paid-tier users toward CloudWatch Database Insights Advanced mode. If your runbooks still say "open Performance Insights" without mentioning Database Insights, update them now.

Start with the managed-service failure modes

The first RDS monitoring mistake is treating it like self-hosted PostgreSQL with fewer chores. RDS has its own operational edges.

  • Storage pressure: Free space, autoscaling events, IOPS, throughput, and disk queue depth matter because a storage bottleneck often looks like a query problem first.
  • Connection pressure: RDS will not save an app that opens too many sessions. Watch active, idle, and waiting connections, not just max_connections.
  • Replica lag: Read replicas can quietly serve stale reads. Alert on lag by business tolerance, not only on a generic threshold.
  • Parameter-group drift: Shared buffers, work_mem, log settings, pg_stat_statements, and autovacuum tuning usually move through parameter groups. Drift between environments creates misleading tests.
  • Maintenance windows: Minor version upgrades, failovers, storage changes, and parameter changes need an evidence trail around the event, not a vague "AWS changed something" note.

Keep pg_stat_statements at the center

CloudWatch tells you when the instance is hot. Query monitoring tells you why. For PostgreSQL, the first durable baseline is pg_stat_statements. It groups query families, tracks calls, total time, mean time, rows, and block behavior, and gives you a way to compare the workload before and after a deploy.

SELECT queryid,
       calls,
       round(total_exec_time::numeric, 1) AS total_ms,
       round(mean_exec_time::numeric, 2) AS mean_ms,
       rows,
       shared_blks_read,
       shared_blks_hit,
       left(query, 160) AS sample_query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

The point is not to stare at a top-20 list forever. The point is to preserve history. If a query goes from 30 ms to 800 ms after a deploy, a live snapshot after the incident may not explain it. You need the before and after shape.

Wait events explain the symptom

RDS metrics often say the database is busy. Wait events tell you what the sessions are waiting for. CPU saturation, lock waits, I/O waits, WAL pressure, client waits, and extension-related waits lead to different fixes.

SELECT wait_event_type,
       wait_event,
       state,
       count(*) AS sessions
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
GROUP BY wait_event_type, wait_event, state
ORDER BY sessions DESC;

If most sessions are waiting on locks, adding a bigger RDS instance class is usually the wrong first move. If they are waiting on I/O while one query family reads millions of blocks, the fix may be an index, a query rewrite, or a workload isolation change. If sessions are mostly idle, the app may be leaking connections or holding transactions open.

Do not let RDS hide vacuum and bloat

Managed service does not mean managed MVCC. Autovacuum still needs room to work. Long transactions still hold old row versions. Dead tuples still build up. Index bloat still hurts cache efficiency. RDS gives you less operating-system control than a self-hosted box, so database-level vacuum evidence becomes more important, not less.

SELECT schemaname,
       relname,
       n_dead_tup,
       n_live_tup,
       last_autovacuum,
       autovacuum_count
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

This query is not an alert by itself, but it is a starting point. Pair it with oldest transaction age, long-running queries, table growth, and autovacuum activity. The worst RDS vacuum incidents usually have a long quiet period where the evidence was present but nobody was collecting it.

Alert on customer risk, not only instance risk

A useful RDS alerting baseline has two layers. The first layer is platform health: CPU, memory pressure, storage free, disk queue depth, IOPS and throughput utilization, replica lag, failed connections, and backup or maintenance failures. The second layer is PostgreSQL behavior: slow query regression, lock waits, idle transactions, autovacuum lag, bloat growth, connection pool saturation, replication slot growth, and error-rate spikes in server logs.

The thresholds should reflect the application. A reporting replica can tolerate more lag than a checkout path reading from a replica. A batch database can tolerate longer queries than an API database. A small instance can look scary at 80% CPU and still be fine if p95 query latency is flat; a larger instance can be in real trouble at 40% CPU if lock waits are climbing.

What MonPG adds on top of RDS

MonPG is not a replacement for AWS monitoring. It fills the database evidence gap. The AWS RDS PostgreSQL monitoring page shows the provider-specific setup, while the broader PostgreSQL monitoring guide explains the shared query, lock, vacuum, and replication signals that matter across every environment.

For RDS, the practical win is correlation. MonPG keeps pg_stat_statements history, query plan context, lock chains, index usage, autovacuum signals, replica lag, and resource pressure in the same workflow. That means an RDS incident review can move from "CPU was high" to "this query family changed after deploy, started reading 40x more blocks, caused I/O waits, then backed up the pool."

The operating model is simple: keep CloudWatch and Database Insights for AWS-level visibility, keep MonPG for PostgreSQL workload evidence, and make the handoff between them explicit in every runbook. Managed Postgres is easier to operate when the team can still see the database.