Error Budgets: Ship Faster, Stay Reliable

Error budgets are a surprisingly simple mechanism for defining and enforcing reliability, and they’re far more about how much unreliability you can tolerate than how to achieve perfect uptime.

Let’s see one in action. Imagine a critical microservice, "AuthService," that handles user logins. Its SLO is 99.9% availability over a 30-day rolling window. This means it can be unavailable for a maximum of (30 days * 24 hours/day * 60 minutes/hour) * 0.001 = 43.2 minutes per month. This is the error budget.

Here’s how the system tracks it. A Prometheus query might look like this:

sum(
  (
    1 - avg_over_time(
      up{job="auth-service"}[30d]
    )
  ) * 100
)

This query calculates the percentage of time the auth-service job has been down over the last 30 days. If this value exceeds 0.1% (which corresponds to 43.2 minutes of downtime), the error budget is spent.

Now, how do we enforce it? When the error budget is depleted, the system automatically triggers actions. For AuthService, this might mean:

Slowing down new deployments: A CI/CD pipeline tool (like Argo CD or Spinnaker) can be configured to halt all new deployments to AuthService if the error budget is zero or negative. A common configuration might be:

# Example Argo CD sync policy
syncPolicy:
  automated:
    prune: true
    selfHeal: true
  syncWindows:
    - kind: Allow
      schedule: "0 0 * * *" # Only allow syncs between midnight and 1 AM daily
      duration: 1h
    - kind: Disallow
      schedule: "0 1 * * *" # Disallow syncs outside the allowed window
      duration: 23h
# This is a simplified example. Real enforcement would involve a webhook
# checking the error budget status before allowing a sync.

The mechanical effect is that the deployment pipeline refuses to push new code, preventing potentially destabilizing changes when the service is already performing poorly.

Reducing traffic: If the service is experiencing unexpected issues, traffic can be automatically routed away from it. An ingress controller (like Nginx or Traefik) can be dynamically updated to reduce the weight of AuthService instances.

// Example dynamic configuration for Nginx Ingress Controller
{
  "upstreams": {
    "auth-service": [
      {"server": "10.0.1.1:8080", "weight": 100},
      {"server": "10.0.1.2:8080", "weight": 100}
    ]
  },
  "servers": [
    {
      "port": 80,
      "locations": [
        {
          "path": "/auth",
          "proxy_pass": "http://auth-service"
        }
      ]
    }
  ]
}
// When error budget is spent, weights might be reduced to 0, or traffic
// shifted to a redundant, more reliable service.

This mechanical effect is to isolate the failing service, protecting users from experiencing errors by directing their requests to a healthy fallback or a reduced set of healthy instances.

Alerting and Escalation: Standard alerting tools (like Alertmanager) are configured to fire high-severity alerts when the error budget is low (e.g., < 20%) and critical alerts when it’s depleted. These alerts would go to the on-call SRE team.

# Example Alertmanager rule
groups:
- name: auth-service-errors
  rules:
  - alert: AuthServiceHighErrorBudgetBurnRate
    expr: |
      (
        sum(
          (
            1 - avg_over_time(
              up{job="auth-service"}[30d]
            )
          ) * 100
        ) / 0.1
      ) > 0.8 # Budget burned > 80%
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "AuthService error budget burn rate is high."
  - alert: AuthServiceErrorBudgetDepleted
    expr: |
      (
        sum(
          (
            1 - avg_over_time(
              up{job="auth-service"}[30d]
            )
          ) * 100
        ) / 0.1
      ) > 1.0 # Budget burned > 100%
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: "AuthService error budget is depleted!"

The mechanical effect is to ensure human intervention is triggered when the automated systems can no longer mitigate the reliability degradation.

The core idea is that perfect reliability is often prohibitively expensive and may not even be desirable if it means sacrificing innovation speed. Error budgets provide a data-driven way to balance these competing concerns, allowing teams to spend their "right to be unreliable" on feature development and experimentation, but forcing them to focus on stability when that budget is exhausted.

What most people miss is that the "budget" isn’t just about downtime; it’s also about latency, error rates, and any other measurable aspect of service quality that contributes to user experience. For example, an SLO might include a requirement that 99% of login requests complete within 200ms. If requests start taking longer, that also burns the error budget, even if the service is technically "up."

The next concept you’ll encounter is how to define appropriate SLOs and error budgets for different types of services, distinguishing between user-facing critical paths and internal, less critical components.

Related Concepts

More Deep Dives in Reliability Engineering (SRE)