Grafana dashboards are often more about the story they tell than the raw data they display, and for SREs, that story needs to be about reliability. Building dashboards that surface Service Level Objectives (SLOs) and the "golden signals" (latency, traffic, errors, saturation) is paramount.

Let’s look at a real-time example of how this plays out. Imagine a dashboard for a hypothetical e-commerce checkout service.

{
  "dashboard": {
    "title": "Checkout Service - Reliability",
    "rows": [
      {
        "panels": [
          {
            "title": "Checkout Latency (p95)",
            "type": "graph",
            "targets": [
              {
                "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"checkout-service\"}[5m])) by (le))",
                "legendFormat": "p95 Latency"
              }
            ],
            "options": {
              "unit": "s"
            }
          },
          {
            "title": "Checkout Traffic (RPS)",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{job=\"checkout-service\"}[5m]))",
                "legendFormat": "Requests/sec"
              }
            ]
          },
          {
            "title": "Checkout Error Rate",
            "type": "graph",
            "targets": [
              {
                "expr": "sum(rate(http_requests_total{job=\"checkout-service\", code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"checkout-service\"}[5m])) * 100",
                "legendFormat": "Error %"
              }
            ],
            "options": {
              "unit": "percent"
            }
          },
          {
            "title": "Checkout Saturation (CPU Usage)",
            "type": "graph",
            "targets": [
              {
                "expr": "avg(rate(process_cpu_seconds_total{job=\"checkout-service\"}[5m])) by (instance)",
                "legendFormat": "CPU Usage"
              }
            ],
            "options": {
              "unit": "percent"
            }
          }
        ]
      },
      {
        "panels": [
          {
            "title": "Checkout Service SLO: Availability",
            "type": "stat",
            "targets": [
              {
                "expr": "100 - (sum(increase(http_requests_total{job=\"checkout-service\", code=~\"5..\"}[1h])) / sum(increase(http_requests_total{job=\"checkout-service\"}[1h]))) * 100",
                "legendFormat": "Availability"
              }
            ],
            "fieldConfig": {
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {"value": 0, "color": "red"},
                  {"value": 99.9, "color": "green"}
                ]
              }
            }
          },
          {
            "title": "Checkout Service SLO: Latency (p95)",
            "type": "stat",
            "targets": [
              {
                "expr": "100 - (sum(rate(http_request_duration_seconds_bucket{job=\"checkout-service\", le=\">=1.0\"}[1h])) / sum(rate(http_request_duration_seconds_bucket{job=\"checkout-service\"}[1h]))) * 100",
                "legendFormat": "p95 Latency"
              }
            ],
            "fieldConfig": {
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {"value": 0, "color": "red"},
                  {"value": 99.9, "color": "green"}
                ]
              }
            }
          }
        ]
      }
    ]
  }
}

This JSON represents a Grafana dashboard configuration. The first row visualizes the four golden signals:

  • Latency: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="checkout-service"}[5m])) by (le)) shows the 95th percentile latency over a 5-minute window. This is crucial because averages can hide outliers that significantly impact user experience.
  • Traffic: sum(rate(http_requests_total{job="checkout-service"}[5m])) provides the requests per second, indicating the load on the service.
  • Errors: sum(rate(http_requests_total{job="checkout-service", code=~"5.."}[5m])) / sum(rate(http_requests_total{job="checkout-service"}[5m])) * 100 calculates the percentage of 5xx server errors over 5 minutes.
  • Saturation: avg(rate(process_cpu_seconds_total{job="checkout-service"}[5m])) by (instance) monitors CPU usage per instance, a common indicator of resource constraint.

The second row then translates these signals into SLOs:

  • Availability SLO: 100 - (sum(increase(http_requests_total{job="checkout-service", code=~"5.."}[1h])) / sum(increase(http_requests_total{job="checkout-service"}[1h]))) * 100 calculates availability over the last hour. It measures the percentage of successful requests (non-5xx) compared to total requests. The threshold is set to 99.9% (indicated by the green color at 99.9).
  • Latency SLO (p95): 100 - (sum(rate(http_request_duration_seconds_bucket{job="checkout-service", le=">=1.0"}[1h])) / sum(rate(http_request_duration_seconds_bucket{job="checkout-service"}[1h]))) * 100 checks if 99.9% of requests were served within 1.0 second over the last hour. Again, the 99.9% threshold is key.

The power here isn’t just seeing the metrics; it’s seeing them together and in relation to the SLOs. When latency spikes, you can immediately see if traffic is also up, if error rates are climbing, or if saturation is hitting limits. The SLO panels provide a clear, at-a-glance view of whether the service is meeting its reliability commitments, with visual cues (colors) indicating breaches.

The most surprising thing about building these dashboards is how much they influence team behavior. When SLOs are visible and tied directly to the underlying signals, teams naturally prioritize work that improves reliability metrics, not just feature development. It creates a shared understanding of what "good" looks like, beyond just "does it work?"

A common pitfall is to over-engineer the SLO definitions or the dashboard queries. For instance, using a fixed time window for error rate calculation like [5m] in the golden signal view is fine for immediate observation, but for SLOs, a rolling window (like [1h]) that accurately reflects the service level agreement is essential. The Prometheus increase() function is often preferred for SLO calculations over rate() because it counts events within the entire window, aligning better with the definition of a "successful" or "failed" request over that period, rather than an instantaneous rate.

The next step in dashboard evolution is often integrating alerting directly from these SLO panels, or building out dashboards for specific user journeys that combine multiple service golden signals and SLOs.

Want structured learning?

Take the full Sre course →