Cloud Providers9 min read

Azure Database for PostgreSQL Monitoring Field Guide

Azure Database for PostgreSQL has strong platform metrics, Query Store, enhanced metrics, and portal dashboards. This guide explains how to turn those signals into PostgreSQL incident evidence.

Azure Database for PostgreSQL Flexible Server gives teams a managed PostgreSQL surface with backups, high availability options, Azure Monitor metrics, diagnostic settings, embedded dashboards, and Query Store. That is a strong starting point. It is not the same as knowing why the database became slow after a deploy.

The Azure portal can show CPU, memory, storage, IOPS, connections, and PostgreSQL logs. Query Store can track query performance over time. Enhanced metrics can expose sessions, wait event types, database counters, autovacuum signals, and PgBouncer behavior. The mistake is assuming those signals automatically become an incident workflow.

A production monitoring design has to connect Azure resource health to PostgreSQL workload evidence. This guide gives a practical baseline for that connection.

Understand what Query Store is and is not

Azure Query Store for PostgreSQL is an opt-in feature. When enabled, it captures query runtime statistics and can also capture wait sampling. Microsoft stores Query Store data in the azure_sys database, with views such as query_store.qs_view and query_store.pgms_wait_sampling_view available for analysis.

That is useful because performance history is the difference between a real regression and a guess. A single slow query snapshot tells you what is happening now. Query history tells you what changed.

There are still operational choices to make. Query Store has capture modes, retention, maximum captured-query settings, interval length, query text limits, and optional plan storage. If you enable it casually, you may collect too little to diagnose a deploy regression or too much for a noisy workload. Treat it as part of the monitoring design, not a checkbox.

Use Azure Monitor for platform pressure

Azure Monitor metrics are the right place to start for platform health. Flexible Server emits default metrics such as active connections, failed connections, CPU percent, memory percent, storage used, storage free, IOPS, read and write throughput, transaction log storage used, and replication lag signals. Many of these metrics are emitted at a one-minute interval, and Microsoft documents metric retention separately from chart query limits.

Those metrics answer the resource question: is the server under pressure? They do not fully answer the database question: which queries, sessions, locks, or maintenance tasks explain the pressure?

The best Azure PostgreSQL dashboards have both layers. A CPU chart without top query families is incomplete. A connection chart without session state is incomplete. Storage growth without table and index growth is incomplete. Replica lag without WAL generation and long transaction evidence is incomplete.

Turn on the enhanced metrics that match the workload

Azure has enhanced metrics for several PostgreSQL internals, but many are disabled by default. The activity metrics can show sessions by state, sessions by wait event type, oldest backend, longest query, longest transaction, and xmin-related signals. Database metrics can expose blocks read and hit, temporary files, deadlocks, tuples changed, and transaction counters. There are also autovacuum and PgBouncer metric families.

The important part is intent. If your app uses PgBouncer, collect PgBouncer diagnostics. If the incident pattern is long transactions and bloat, collect autovacuum diagnostics. If the application has many databases, understand the documented dimension limits before assuming every database will appear as a separate clean line in Azure Monitor.

Do not turn on every metric because a dashboard looks better with more lines. Turn on the metrics that map to the failure modes you actually need to catch.

Keep PostgreSQL views in the runbook

Query Store and Azure Monitor are useful, but incident response still needs direct PostgreSQL views. The fastest path to clarity often starts with session state, wait events, locks, and top queries.

SELECT state,
       wait_event_type,
       wait_event,
       count(*) AS sessions
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;

This tells you whether the system is waiting on locks, I/O, client behavior, or active query execution. From there, you can inspect blockers.

SELECT blocked.pid AS blocked_pid,
       blocker.pid AS blocker_pid,
       blocked.query AS blocked_query,
       blocker.query AS blocker_query
FROM pg_locks blocked_lock
JOIN pg_stat_activity blocked ON blocked.pid = blocked_lock.pid
JOIN pg_locks blocker_lock
  ON blocker_lock.locktype = blocked_lock.locktype
 AND blocker_lock.database IS NOT DISTINCT FROM blocked_lock.database
 AND blocker_lock.relation IS NOT DISTINCT FROM blocked_lock.relation
 AND blocker_lock.page IS NOT DISTINCT FROM blocked_lock.page
 AND blocker_lock.tuple IS NOT DISTINCT FROM blocked_lock.tuple
 AND blocker_lock.transactionid IS NOT DISTINCT FROM blocked_lock.transactionid
 AND blocker_lock.classid IS NOT DISTINCT FROM blocked_lock.classid
 AND blocker_lock.objid IS NOT DISTINCT FROM blocked_lock.objid
 AND blocker_lock.objsubid IS NOT DISTINCT FROM blocked_lock.objsubid
JOIN pg_stat_activity blocker ON blocker.pid = blocker_lock.pid
WHERE NOT blocked_lock.granted
  AND blocker_lock.granted;

These are not replacements for Azure-native tools. They are the bridge from "the server looks bad" to "this session is blocking the workload."

Watch the Azure-specific PostgreSQL edges

Azure PostgreSQL incidents often cluster around a few patterns.

  • Burstable tier misuse: CPU credit behavior can hide the real shape of a workload until credits are gone. Query Store is also not a good fit for every small or burstable setup.
  • PgBouncer blind spots: A healthy database can still have waiting clients if the pool is saturated or incorrectly sized.
  • Autovacuum visibility: Dead tuples, oldest xmin, and long transactions need explicit monitoring. Managed service does not remove MVCC.
  • Diagnostic settings drift: PostgreSQL logs are not useful in Log Analytics if nobody configured them to stream there.
  • Replica lag and HA assumptions: High availability and read replicas still need lag, WAL, and failover-readiness checks.

Alert on the symptom and the cause

A good Azure PostgreSQL alert set pairs platform symptoms with database causes. Alert on CPU, memory, storage, failed connections, IOPS, throughput, and replica lag. Also alert on lock waits, longest transaction age, query regression, temporary file spikes, deadlocks, autovacuum lag, connection pool saturation, and slow query families.

The thresholds should be workload-specific. A nightly analytics job can use CPU heavily without being an incident. A checkout database with growing lock waits is an incident even if CPU looks calm. Query Store history, wait sampling, and PostgreSQL views give you the evidence to make that distinction.

Where MonPG fits

The Azure PostgreSQL monitoring page covers the MonPG setup for Azure-hosted PostgreSQL. The broader PostgreSQL monitoring guide covers the shared database signals that matter no matter where PostgreSQL runs.

MonPG complements Azure Monitor by keeping PostgreSQL workload history close to the incident workflow: query fingerprints, plan context, index usage, lock chains, connection states, autovacuum and bloat signals, replication lag, and resource pressure. Instead of switching between a portal graph, a query view, a log workspace, and a spreadsheet of incident notes, the team can keep a single PostgreSQL evidence trail.

The right model is not Azure Monitor versus MonPG. Use Azure Monitor for the managed resource, Query Store for Azure-native query history, and MonPG for cross-environment PostgreSQL diagnosis. The point is to make the database understandable when the cloud platform says something is wrong.