Alerts are supposed to tell you when something’s wrong, but too many meaningless alerts just make you tune them out.
Let’s watch an alert get created and then silenced by a real system. Imagine we’re running a web service, and we’ve got Prometheus scraping metrics from it.
# prometheus.yml
scrape_configs:
- job_name: 'my-webapp'
static_configs:
- targets: ['localhost:9090']
# Alerting rules in rules.yml
groups:
- name: webapp_alerts
rules:
- alert: HighRequestLatency
expr: http_requests_total{status_code=~"5.."} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected on webapp"
description: "{{ $value }} HTTP 5xx errors in the last 5 minutes."
Prometheus evaluates HighRequestLatency. If it sees more than zero 5xx errors for a continuous 5 minutes, it fires. Now, imagine this alert fires constantly because of a transient network blip that resolves itself in 30 seconds. The alert keeps firing, then stops, then fires again. It’s noisy.
Here’s how we’d silence it temporarily using amtool, Prometheus’s own command-line tool.
amtool alert silence add \
--alert-name="HighRequestLatency" \
--comment="Temporary silence due to known transient issue" \
--duration="1h" \
--match="job=~my-webapp"
This tells Alertmanager (which Prometheus sends alerts to) to ignore any alerts named HighRequestLatency that come from the my-webapp job for the next hour. This is a manual fix, though. We need to automate this.
The core problem is that alerts are often too broad, firing on symptoms rather than root causes, or they fire on transient conditions that don’t require immediate human intervention. We need to shift from "what’s happening now" to "what matters and persists."
The best alerts are actionable. They tell you not just that something is wrong, but what is wrong and how to fix it. This often means alerting on service-level objectives (SLOs) rather than raw system metrics.
Consider an SLO for request success rate. Instead of alerting when http_requests_total{status_code=~"5.."} is greater than 0, we alert when the error rate over a longer period exceeds our SLO.
# Alerting rules in rules.yml
groups:
- name: webapp_slo_alerts
rules:
- alert: WebappErrorRateTooHigh
expr: |
sum(increase(http_requests_total{status_code=~"5.."}[5m]))
/
sum(increase(http_requests_total[5m]))
> 0.01 # Alert if error rate exceeds 1%
for: 10m # Must persist for 10 minutes
labels:
severity: warning
annotations:
summary: "Webapp error rate is {{ $value | printf `%.2f` }}%, exceeding SLO."
description: "The error rate for the webapp has been above 1% for 10 minutes."
This alert only fires if the rate of errors is consistently high, indicating a real degradation of service, not just a few hiccups. The for: 10m is crucial; it means the condition must hold true for 10 minutes before the alert fires. This filters out transient spikes.
Another common cause of fatigue is alerting on the same issue from multiple angles. If your disk is full, you might get alerts for: "Disk Usage High," "Service X Down (because it can’t write logs)," and "Application Y Crashing (because it can’t write data)." You only need one alert that points to the root cause: disk full. This requires understanding your system’s dependencies and how failures cascade.
The most effective way to combat alert fatigue is to have an "alerting budget." For every alert you create, you should be able to articulate:
- What service/feature does this protect?
- What is the user impact if this alert fires?
- What action should the on-call engineer take?
- How quickly can this action resolve the issue?
- What is the acceptable rate of false positives for this alert? (This is where SLOs shine).
If you can’t answer these, the alert is probably not ready.
Many teams fall into the trap of just adding more rules to their alerting system without a process for retiring old, noisy, or irrelevant alerts. Regularly review your active alerts. For each one, ask: "Has this alert fired in the last month? If so, was it actionable? If not, should we remove it or tune it?" Tools like PagerDuty or Opsgenie often have dashboards showing alert frequency, which is a great starting point for this review.
Consider the severity label. It’s not just for routing; it’s a signal for urgency. A critical alert should mean "stop everything and fix this now." A warning might mean "investigate within the hour" or "fix before end of day." If you have 50 critical alerts firing daily, the label has lost its meaning. Re-evaluate your severity levels and ensure they align with genuine, immediate user impact.
Often, the most impactful "fix" for an alert isn’t changing the alert rule itself, but improving the system it monitors. If your application frequently crashes due to out-of-memory errors, the alert will keep firing. The real solution is to fix the memory leak, not just silence the alert. This requires a feedback loop where alert investigations lead to system improvements, which in turn reduce alert volume.
The next logical step after reducing alert fatigue is to implement automated remediation for your most common, well-understood alerts.