Datadog monitors are the eyes and ears of your SRE practice, constantly watching for anomalies and potential issues before they impact users.
Here’s what it looks like when a Datadog monitor is actively catching a problem:
# Example: High Error Rate on Web Service
@slack-prod-alerts
#web-service-errors
**High Error Rate on Web Service**
*Alerting*
This monitor is alerting because the error rate for the web service has exceeded 5% over the last 5 minutes.
**Metric:** `avg:web.request.error_count{service:web-service} by {host}`
**Threshold:** `> 0.05`
**Evaluation Period:** `last 5 minutes`
**Current Value:** `0.07`
**Runbook:** [Link to Runbook](https://your-runbook-url.com/web-service-errors)
This monitor is designed to catch a spike in HTTP 5xx errors from your web-service. The avg function calculates the average error rate, and the > 0.05 threshold means it triggers if that average goes above 5%. The by {host} clause ensures that if any single host has a high error rate, it will be flagged.
Setting up monitors and dashboards in Datadog is about translating your understanding of system health into actionable alerts and intuitive visualizations. It’s not just about seeing numbers; it’s about seeing meaningful numbers, tied to the user experience and system reliability.
The Core Problem: Unseen Failures
Before robust monitoring, failures would often manifest as user complaints or outright outages. The core problem Datadog solves is making system behavior transparent. It allows SREs to identify deviations from normal, expected behavior proactively. This means catching a disk filling up before it causes application crashes, or noticing increased latency before users experience slow load times.
How Monitors Work: The Query Engine
At its heart, a Datadog monitor is a scheduled query against your metrics. You define:
- What to measure: A metric (e.g.,
cpu.usage,network.bytes_sent,web.request.latency). - What context: Tags to filter the metric (e.g.,
service:api,env:production,host:web-01). - How to aggregate: A function (e.g.,
avg,sum,min,max,p95). - When to alert: A threshold and evaluation window (e.g.,
> 90% for 5 minutes). - What to do: Notifications (Slack, PagerDuty, email).
Let’s break down a common scenario: ensuring your database isn’t overloaded.
Example: Database Connection Pool Exhaustion
Imagine your database is struggling because too many applications are trying to connect simultaneously. This can lead to requests being queued and eventually dropped.
Metric: pg.connection.count (for PostgreSQL) or mysql.threads_connected (for MySQL).
Tags: service:database, env:production.
Aggregation: max (we care about the peak usage).
Threshold: > 80% of max_connections (e.g., if max_connections is 100, alert if max > 80).
Evaluation: for 1 minute (we want to catch sustained pressure).
A Datadog monitor for this might look like:
# Database Connection Pool Alert
@pagerduty-database-team
**High Database Connection Usage**
*Alerting*
Database connection count has exceeded 80% of the configured maximum for the last minute. This may indicate an application issue or insufficient connection pool sizing.
**Metric:** `max:pg.connection.count{service:database,env:production}`
**Threshold:** `> 80` (assuming `max_connections` is configured to 100 in the monitor)
**Evaluation Period:** `last 1 minute`
**Current Value:** `85`
**Runbook:** [Link to DB Connection Runbook](https://your-runbook-url.com/db-connections)
Here, we’re using max to see the peak number of connections. The threshold is set to 80, implying that the monitor is configured with an override or a direct value that represents 80% of the database’s max_connections setting. The for 1 minute ensures we don’t get alerted by brief, transient spikes.
Dashboards: The Storyteller
Dashboards are where you bring your monitors, metrics, and logs together to tell a coherent story about your system’s health. A good dashboard isn’t just a collection of graphs; it’s a curated narrative.
Consider a dashboard for your primary API service. It should include:
- Key Performance Indicators (KPIs): User-facing metrics like request latency (p95 and p99), error rates (4xx and 5xx), and throughput (requests per second).
- Resource Utilization: CPU, memory, disk I/O, and network traffic for the hosts running the API.
- Dependency Health: Metrics from services your API relies on (e.g., database connection usage, cache hit rates).
- Logs: A log stream filtered to show errors or important events related to the API.
- Active Monitors: A widget showing currently triggered monitors.
Example Dashboard Widget Configuration (JSON):
{
"viz": "timeseries",
"requests": [
{
"q": "avg:web.request.latency{service:api} by {host}",
"display_type": "area",
"style": {
"palette": "cool",
"type": "solid",
"width": "normal"
},
"title": "API Latency (p95)"
},
{
"q": "p95:web.request.latency{service:api} by {host}",
"display_type": "line",
"style": {
"palette": "fire",
"type": "solid",
"width": "bold"
},
"title": "API Latency (p95)"
}
],
"autoscale": true,
"title": "API Request Latency (p95)"
}
This widget displays the average and p95 latency for the api service. The viz: "timeseries" tells Datadog to draw a line graph over time. The requests array defines the queries to run. We’re using avg and p95 aggregation functions, filtering by service:api. The style options control the appearance.
The Counterintuitive Truth: Alerts Aren’t the Goal
The true goal isn’t to have a high number of alerts; it’s to have zero alerts that indicate a user-impacting problem. Every alert is a signal that something is deviating from expected, reliable behavior. If you’re getting too many alerts, it means either your system is genuinely unstable, or your alerting thresholds are too sensitive or poorly defined. The most effective SRE teams strive for silence, punctuated only by alerts that genuinely require immediate human intervention because they represent a clear and present danger to service availability or performance. This often means tuning thresholds so they only fire when a metric crosses a line that definitively correlates with user impact, rather than just a statistical anomaly.
The next step after setting up effective monitors and dashboards is understanding how to correlate these signals with distributed traces to pinpoint the root cause of complex issues.