monitoring-database-health

'Monitor use when you need to work with monitoring and observability.

v1.0.0

Jeremy Longshore

MIT

8 Tools

database-health-monitor Plugin

database Category

Allowed Tools
        ReadWriteEditGrepGlobBash(psql:*)Bash(mysql:*)Bash(mongosh:*)
      

Provided by Plugin

database-health-monitor

Database plugin for database-health-monitor

database v1.0.0

View Plugin

Installation

This skill is included in the database-health-monitor plugin:

/plugin install database-health-monitor@claude-code-plugins-plus

Click to copy

Instructions

Database Health Monitor

Overview

Monitor database server health across PostgreSQL, MySQL, and MongoDB by tracking key performance indicators including connection utilization, query throughput, replication lag, disk usage, cache hit ratios, vacuum activity, and lock contention.

Prerequisites

Database credentials with access to system statistics views (pgstat*, performance_schema, serverStatus)
psql, mysql, or mongosh CLI tools for running health check queries
Permissions: pg_monitor role (PostgreSQL), PROCESS privilege (MySQL)
Baseline metrics from a period of normal operation for threshold calibration
Alerting channel configured (email, Slack webhook, PagerDuty)

Instructions

Check connection utilization:

PostgreSQL: SELECT count() AS activeconnections, (SELECT setting::int FROM pgsettings WHERE name = 'maxconnections') AS maxconnections, round(count()::numeric / (SELECT setting::int FROM pgsettings WHERE name = 'maxconnections') * 100, 1) AS utilizationpct FROM pgstat_activity
MySQL: SELECT VARIABLEVALUE AS connections FROM performanceschema.globalstatus WHERE VARIABLENAME = 'Threads_connected'
Alert threshold: utilization above 80%

Monitor query throughput and error rate:

PostgreSQL: SELECT datname, xactcommit AS commitstotal, xactrollback AS rollbackstotal, xactrollback::float / GREATEST(xactcommit, 1) AS rollbackratio FROM pgstatdatabase WHERE datname = currentdatabase()
MySQL: SHOW GLOBAL STATUS LIKE 'Comcommit' and SHOW GLOBAL STATUS LIKE 'Comrollback'
Alert threshold: rollback ratio above 5% or throughput drops more than 50% from baseline

Check disk usage and growth:

PostgreSQL: SELECT pgsizepretty(pgdatabasesize(currentdatabase())) AS dbsize and SELECT tablename, pgsizepretty(pgtotalrelationsize(tablename::text)) AS size FROM pgtables WHERE schemaname = 'public' ORDER BY pgtotalrelation_size(tablename::text) DESC LIMIT 10
Alert threshold: disk usage above 80% or growth rate projecting full disk within 7 days

Monitor cache hit ratio:

PostgreSQL: SELECT sum(heapblkshit)::float / GREATEST(sum(heapblkshit) + sum(heapblksread), 1) AS cachehitratio FROM pgstatiouser_tables
MySQL: SELECT (1 - (VARIABLEVALUE / (SELECT VARIABLEVALUE FROM performanceschema.globalstatus WHERE VARIABLENAME = 'Innodbbufferpoolreadrequests'))) AS hitratio FROM performanceschema.globalstatus WHERE VARIABLENAME = 'Innodbbufferpoolreads'
Alert threshold: cache hit ratio below 95% indicates sharedbuffers or innodbbufferpoolsize needs increasing

Check vacuum and autovacuum health (PostgreSQL):

SELECT relname, lastvacuum, lastautovacuum, ndeadtup, nlivetup, round(ndeadtup::numeric / GREATEST(nlivetup, 1) * 100, 1) AS deadpct FROM pgstatusertables WHERE ndeadtup > 1000 ORDER BY ndeadtup DESC LIMIT 10
Alert threshold: dead tuple percentage above 20% or autovacuum not running for more than 24 hours on active tables

Monitor replication lag (if replicas exist):

PostgreSQL: SELECT clientaddr, state, pgwallsndiff(sentlsn, replaylsn) AS lagbytes FROM pgstat_replication
MySQL: SHOW REPLICA STATUS\G - check SecondsBehindSource
Alert threshold: lag above 30 seconds or replication stopped

Check for long-running queries:

PostgreSQL: SELECT pid, now() - querystart AS duration, state, query FROM pgstatactivity WHERE state != 'idle' AND now() - querystart > interval '5 minutes' ORDER BY duration DESC
Alert threshold: any query running longer than 10 minutes (OLTP) or 1 hour (analytics)

Monitor lock contention:

PostgreSQL: SELECT count(*) AS waitingqueries FROM pgstatactivity WHERE waitevent_type = 'Lock'
Alert threshold: more than 10 queries waiting for locks simultaneously

Compile all health checks into a single monitoring script that runs via cron every 60 seconds, outputs metrics in a structured format (JSON), and triggers alerts when thresholds are breached.

Create a health summary dashboard query that returns a single-row result with RAG (Red/Amber/Green) status for each health dimension: connections, throughput, disk, cache, vacuum, replication, queries, and locks.

Output

Health check queries tailored to the specific database engine
Monitoring script (shell or Python) for scheduled health checks with alerting
Threshold configuration with default values and tuning guidance
Dashboard summary query providing RAG status across all health dimensions
Alert notification templates for Slack, email, or PagerDuty integration

Error Handling

Error	Cause	Solution
`pgstatactivity` returns incomplete data	`track_activities = off` in postgresql.conf	Enable `trackactivities = on` and `track``counts = on`; reload configuration
Health check query itself times out	Database under heavy load or lock contention	Set `statement_timeout = '5s'` for monitoring queries; use a dedicated monitoring connection
False alerts during maintenance windows	Planned maintenance triggers threshold breaches	Implement alert suppression windows; add maintenance mode flag to monitoring script
Disk usage alert but no obvious growth	WAL files, temporary files, or pgstattmp consuming space	Check `pgwal` directory size; check for orphaned temporary files; verify `wal``keep_size` setting
Cache hit ratio drops after restart	Buffer pool/shared_buffers cold after database restart	Implement cache warming script that runs key queries after restart; alert will self-resolve as cache warms

Examples

PostgreSQL health dashboard for a production SaaS application: A single cron-based script checks 8 health dimensions every 60 seconds, writing results to a metrics table. A dashboard query shows: connections 45/200 (GREEN), cache hit 98.5% (GREEN), dead tuples 2.1% (GREEN), disk 62% (GREEN), replication lag 0.5s (GREEN), long queries 0 (GREEN), lock waiters 1 (GREEN), rollback ratio 0.3% (GREEN).

Detecting impending disk full condition: Health monitor tracks daily disk growth rate. Current usage: 72%, daily growth: 1.2GB, remaining: 280GB. Projected full date: 233 days. Alert triggers at 80% with recommendation to archive old data or add storage. A second alert at 90% escalates to PagerDuty.

Identifying autovacuum falling behind: Health check shows the events table with 15M dead tuples (45% dead ratio) and last autovacuum 3 days ago. Root cause: autovacuumvacuumcostdelay too conservative for a high-write table. Fix: set per-table autovacuumvacuumcostdelay = 2 and autovacuumvacuumscale_factor = 0.01.

Resources

PostgreSQL monitoring statistics: https://www.postgresql.org/docs/current/monitoring-stats.html
MySQL performance_schema: https://dev.mysql.com/doc/refman/8.0/en/performance-schema.html
MongoDB serverStatus: https://www.mongodb.com/docs/manual/reference/command/serverStatus/
PostgreSQL autovacuum tuning: https://www.postgresql.org/docs/current/runtime-config-autovacuum.html
checkpostgres monitoring tool: https://bucardo.org/checkpostgres/

Allowed Tools

Provided by Plugin

database-health-monitor

Installation

Instructions

Database Health Monitor

Overview

Prerequisites

Instructions

Output

Error Handling

Examples

Resources

Ready to use database-health-monitor?

Related Skills

analyzing-database-indexes

analyzing-query-performance

archiving-databases

automating-database-backups

comparing-database-schemas

database-documentation-gen