Monitoring, alerting, and automation aren’t just a checklist of tools; they’re a continuous feedback loop that turns reactive firefighting into proactive system resilience.
Let’s watch an alertmanager process a firing alert from Prometheus.
Imagine a Prometheus server scraping metrics from a Kubernetes cluster. When a node_exporter on a specific node reports CPU usage exceeding 95% for 5 minutes, Prometheus triggers an alert. This alert is then sent to Alertmanager.
# Prometheus rules.yml
groups:
- name: node_alerts
rules:
- alert: HighCpuUsage
expr: node_cpu_seconds_total{mode="idle"} == 0 # This is a simplified example; real-world would be a rate calculation.
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "Node {{ $labels.instance }} has been running with high CPU utilization for over 5 minutes."
# Prometheus config.yml
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager.monitoring.svc.cluster.local:9093']
# Alertmanager config.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default-receiver'
receivers:
- name: 'default-receiver'
webhook_configs:
- url: 'http://your-webhook-receiver.example.com/alert' # Or Slack, PagerDuty, etc.
When Prometheus detects the HighCpuUsage alert condition is met for the specified duration, it sends an HTTP POST request to Alertmanager’s /api/v1/alerts endpoint. Alertmanager, configured with its routing rules, processes this alert. It checks if alerts with the same alertname and cluster label are already grouped. If not, it waits for group_wait (30s). If the alert persists and more alerts arrive that fit the same group, they’re bundled. Once the group_wait is over, or if the alert is unique, Alertmanager sends it to the configured receiver. In this example, it’s a webhook receiver.
The fundamental problem this solves is signal overload. Without aggregation, routing, and deduplication, every single metric breach would flood communication channels. Alertmanager acts as a smart dispatcher. It groups related alerts (e.g., multiple pods on the same node failing), suppresses flapping alerts (briefly firing and resolving), and ensures alerts are sent to the right people or systems at the right time, based on severity and labels. This prevents alert fatigue and makes sure critical issues are addressed promptly.
The most surprising thing most people don’t realize is that Alertmanager’s routing tree is evaluated top-down, and the first matching rule determines the receiver. You can have a broad rule at the top to catch all critical alerts and send them to PagerDuty, and then more specific rules further down for warning alerts destined for Slack, but if a critical alert also matches a warning rule lower down, it will only go to PagerDuty because that rule was matched first. The continue: true directive is crucial if you want an alert to be processed by multiple matching routes.
The next challenge is ensuring your automation tools can effectively act on these routed alerts, perhaps by integrating with the same webhook receiver.