Monitor use when deploying monitoring stacks including Prometheus, Grafana, and Datadog.
ReadWriteEditGrepGlobBash(docker:*)Bash(kubectl:*)
Deploying Monitoring Stacks
Overview
Deploy production monitoring stacks (Prometheus + Grafana, Datadog, or Victoria Metrics) with metric collection, custom dashboards, and alerting rules. Configure exporters, scrape targets, recording rules, and notification channels for comprehensive infrastructure and application observability.
Prerequisites
- Target infrastructure identified: Kubernetes cluster, Docker hosts, or bare-metal servers
- Metric endpoints accessible from the monitoring platform (application
/metrics, node exporters)
- Storage backend capacity planned for time-series data (Prometheus TSDB, Thanos, or Cortex for long-term)
- Alert notification channels defined: Slack webhook, PagerDuty integration key, or email SMTP
- Helm 3+ for Kubernetes deployments using kube-prometheus-stack or similar charts
Instructions
- Select the monitoring platform: Prometheus + Grafana for open-source self-hosted, Datadog for managed SaaS, Victoria Metrics for high-cardinality workloads
- Deploy the monitoring stack:
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack or Docker Compose for non-Kubernetes
- Install exporters on monitored systems: node-exporter for host metrics, kube-state-metrics for Kubernetes object states, application-specific exporters
- Configure scrape targets in
prometheus.yml: define job names, scrape intervals, and relabeling rules for service discovery
- Create recording rules for frequently queried aggregations to reduce dashboard query load
- Define alerting rules with meaningful thresholds: high CPU (>80% for 5m), high memory (>90%), error rate (>1%), latency P99 (>500ms)
- Configure Alertmanager with routing, grouping, and notification channels (Slack, PagerDuty, email)
- Build Grafana dashboards: RED metrics (Rate, Errors, Duration) for services, USE metrics (Utilization, Saturation, Errors) for resources
- Set up data retention: configure TSDB retention period (15-30 days local), set up Thanos/Cortex for long-term storage if needed
- Test the full pipeline: trigger a test alert and verify notification delivery
Output
- Helm values file or Docker Compose for the monitoring stack
- Prometheus configuration with scrape targets, recording rules, and alerting rules
- Alertmanager configuration with routing tree and notification receivers
- Grafana dashboard JSON files for infrastructure and application metrics
- Exporter deployment manifests (node-exporter DaemonSet, application ServiceMonitor)
Error Handling
| Error |
Cause |
Solution |
No data points in dashboard |
Scrape target not reachable or metric name wrong |
Check Targets
Ready to use monitoring-stack-deployer?
|