The SRE Reliability Hierarchy isn’t about building up from a base; it’s about building out from a stable core, where each layer is a prerequisite for the next, not just a building block.

Imagine you’re setting up a new service. You’ve got your code, your databases, your load balancers. But before you even think about deploying, you need to consider the foundational layers that support everything.

Let’s map this out with a hypothetical service, user-profile-api, running on Kubernetes.

Layer 1: Infrastructure as Code (IaC)

This is your bedrock. If your infrastructure isn’t codified, it’s ephemeral and prone to drift.

  • What it looks like: Terraform or Pulumi configurations defining your cloud resources, Ansible playbooks for server configuration, Kubernetes YAMLs for your deployments, services, and ingress.
  • In action (Kubernetes example):
    # deployment.yaml
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: user-profile-api
      labels:
        app: user-profile-api
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: user-profile-api
      template:
        metadata:
          labels:
            app: user-profile-api
        spec:
          containers:
          - name: api
            image: your-docker-repo/user-profile-api:v1.2.0
            ports:
            - containerPort: 8080
            resources:
              requests:
                memory: "64Mi"
                cpu: "100m"
              limits:
                memory: "128Mi"
                cpu: "200m"
    
    This YAML defines three replicas of your user-profile-api container, specifying resource requests and limits. Running kubectl apply -f deployment.yaml makes it real.

Layer 2: Observability (Metrics, Logs, Traces)

Once your infrastructure is defined, you need to see what’s happening within it. This isn’t just about dashboards; it’s about having the right signals to understand system health.

  • What it looks like: Prometheus for metrics, Elasticsearch/Loki for logs, Jaeger/Tempo for traces.
  • In action:
    • Metrics: Your user-profile-api application exposes Prometheus metrics (e.g., HTTP request duration, error counts).
      // Example using Prometheus client library in Go
      var (
          httpRequestsTotal = prometheus.NewCounterVec(
              prometheus.CounterOpts{
                  Name: "http_requests_total",
                  Help: "Total number of HTTP requests received.",
              },
              []string{"method", "path", "status"},
          )
      )
      
      // In your handler:
      httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
      
      Prometheus scrapes these metrics, allowing you to build alerts like ALERTS IF sum(rate(http_requests_total{status=~"5.."} [5m])) by (path) > 10.
    • Logs: Structured logs (JSON) sent to a central aggregator.
      {
        "timestamp": "2023-10-27T10:30:00Z",
        "level": "INFO",
        "message": "User profile retrieved successfully",
        "user_id": "abc123xyz",
        "duration_ms": 15
      }
      
    • Traces: Distributed tracing shows the journey of a request across services. A request to user-profile-api might involve calls to auth-service and data-store. Tracing visualizes this entire flow.

Layer 3: Automation (CI/CD, Incident Response)

With infrastructure defined and observable, you automate repetitive tasks and build resilience.

  • What it looks like: GitHub Actions/GitLab CI for builds and deployments, automated rollback scripts, automated scaling policies.
  • In action:
    • CI/CD: A git push to your main branch triggers a pipeline: build Docker image, push to registry, deploy to Kubernetes using kubectl apply (or Helm/Kustomize).
    • Automated Response: If Prometheus alerts trigger an incident, an automation script could:
      1. Scale up replicas: kubectl scale deployment user-profile-api --replicas=5
      2. If still unhealthy after 5 minutes, trigger a canary rollback: kubectl rollout undo deployment user-profile-api

Layer 4: SLOs and Error Budgets

This is where you define what "reliable" actually means for your service, quantitatively.

  • What it looks like: Service Level Objectives (SLOs) based on error budget consumption.
  • In action:
    • SLO: For user-profile-api, we define an SLO for availability: "99.95% of requests will succeed within 200ms over a rolling 28-day period."
    • Error Budget: This leaves us with a 0.05% error budget. If we burn through this budget too quickly (e.g., due to frequent 5xx errors or high latency), deployments might be paused, and engineering focus shifts to reliability fixes.
    • Measurement: Prometheus queries can track this:
      (sum(rate(http_requests_total{status=~"2.."} [28d])) / sum(rate(http_requests_total[28d]))) * 100
      
      This metric, when compared against your 99.95% target, shows your current availability and error budget burn rate.

Layer 5: Resilience Patterns

These are architectural choices and patterns that make your system robust against failure.

  • What it looks like: Circuit breakers, retries with exponential backoff, rate limiting, graceful degradation.
  • In action:
    • Circuit Breaker: If user-profile-api starts failing calls to auth-service (indicated by high error rate metrics from auth-service or its own high latency), user-profile-api’s client library (e.g., Hystrix, Resilience4j) trips a circuit breaker. Subsequent calls to auth-service are immediately failed without hitting the network, preventing cascading failures and allowing auth-service time to recover.
    • Rate Limiting: To protect downstream services or itself from overload, user-profile-api might implement rate limiting based on user ID or IP address, returning 429 Too Many Requests when limits are exceeded.

The most surprising thing about this hierarchy is how much it deviates from typical "agile" or "DevOps" thinking that often prioritizes rapid feature delivery. The SRE model explicitly states that without the lower layers being robust and automated, rapid feature delivery will inevitably lead to unreliable systems, burning through error budgets and causing customer pain.

The true power of this hierarchy is that each layer provides the necessary foundation and feedback loop for the layers above. IaC ensures your environment is repeatable, observability tells you if something is wrong, automation fixes it or alerts you, SLOs tell you how wrong is too wrong, and resilience patterns prevent minor issues from becoming catastrophic ones.

The next concept you’ll grapple with is defining meaningful error budgets for different SLOs simultaneously, and how to prioritize work when multiple error budgets are being consumed.

Want structured learning?

Take the full Sre course →