SLOs are a surprisingly effective way to prevent work by forcing you to define what "good" looks like before things break.

Let’s say you’re running a critical service on Kubernetes, like a user-facing API. You’ve deployed it, it’s humming along, but how do you know it’s actually good for your users? This is where Service Level Objectives (SLOs) come in. An SLO is a target for a specific metric that represents the user experience. For our API, a good SLO might be "99.9% of requests served in under 200ms over a 30-day rolling window."

Here’s how you might instrument that. You’d typically use a metrics system like Prometheus and a service mesh like Istio or Linkerd to capture request latency.

# Example Prometheus recording rule for API request latency
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-slo-rules
spec:
  groups:
  - name: api.slo
    rules:
    - record: api_request_duration_seconds:histogram_bucket
      expr: |
        histogram_bucket{job="my-api", le="0.2"}[30d]
    - record: api_requests_total
      expr: |
        count({job="my-api"}) by (code)

This Prometheus configuration captures two key pieces of information: the number of requests served within 200ms (the api_request_duration_seconds:histogram_bucket rule) and the total number of requests (the api_requests_total rule). The [30d] ensures we’re looking at a 30-day rolling window.

With these metrics in place, you can define your SLO in a tool like thanos-rules or directly in Prometheus’s alerting manager. The calculation for your SLO would look something like this:

SLO_Compliance = (sum(api_request_duration_seconds:histogram_bucket{job="my-api", le="0.2"}[30d])) / (sum(api_requests_total{job="my-api"}) by (code)) * 100

This formula calculates the percentage of requests that completed within 200ms over the last 30 days. If this SLO_Compliance metric drops below 99.9%, you have an "SLO breach."

Alerting on SLO breaches is crucial. Instead of alerting on every minor blip, you alert when the long-term user experience degrades. This prevents alert fatigue. A common pattern is to set up alert thresholds for "pre-breach" states, like 99.95% and 99.98% compliance.

# Example Prometheus Alerting Rule for SLO breach
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-slo-alerts
spec:
  groups:
  - name: api.slo.alerts
    rules:
    - alert: SLOBreakdown_API_Latency
      expr: |
        (sum by (code) (rate(api_request_duration_seconds_bucket{job="my-api", le="0.2"}[5m])) / sum by (code) (rate(api_requests_total{job="my-api"}[5m]))) * 100 < 99.9
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "API Latency SLO breach detected!"
        description: "99.9% of API requests over 30 days are not being served under 200ms."

This alert fires if the rate of requests served under 200ms drops below the target, giving you a heads-up before the 30-day window officially breaches. The for: 10m means the condition must be true for 10 minutes before the alert fires, reducing flapping.

When an alert does fire, a well-defined runbook is your best friend. A runbook is a step-by-step guide for diagnosing and resolving an incident. For our API latency SLO, a runbook might start with:

  1. Check the dashboard: Navigate to grafana.example.com/d/api-overview/api-performance?orgId=1&from=now-1h&to=now. Look for spikes in error rates, increased latency, or resource saturation (CPU, memory).
  2. Examine the error budget: If the SLO is dipping, check your error budget. How much "badness" do you have left? If it’s nearly depleted, you might need to consider disabling non-critical features or even rolling back a recent deployment.
  3. Trace the request: Use distributed tracing (e.g., Jaeger, Zipkin) to follow a slow request from the user’s browser through your Kubernetes cluster. Identify which service or component is introducing the latency.
  4. Check downstream dependencies: Is your API waiting on a slow database query, an external API call, or another internal service?
  5. Review recent deployments: Has a new deployment or configuration change coincided with the latency increase? kubectl rollout history deployment/my-api can help here.
  6. Scale appropriately: If resource utilization is high, consider increasing the replica count for your API deployment or scaling up underlying nodes. kubectl scale deployment my-api --replicas=5.

The surprising truth about SLOs is that their primary value isn’t in the alerts they generate, but in the shared understanding they create about what "good enough" means, forcing product and engineering teams to agree on user experience targets.

One subtle aspect of SLOs is the concept of the "error budget." If your SLO is 99.9% availability, you have a 0.1% error budget over the period. This budget is a finite resource. If you spend it, you’ve "failed" the SLO for that period. This can be a powerful tool for prioritizing work; if you’re burning through your error budget, all non-essential feature development might pause until the SLO is back on track.

The next logical step after mastering SLOs for your services is understanding how to manage the error budget itself, perhaps by automating feature flag rollbacks or incident response based on its consumption.

Want structured learning?

Take the full Sre course →