The SRE Reliability Hierarchy isn’t about building up from a base; it’s about building out from a stable core, where each layer is a prerequisite for the next, not just a building block.
Imagine you’re setting up a new service. You’ve got your code, your databases, your load balancers. But before you even think about deploying, you need to consider the foundational layers that support everything.
Let’s map this out with a hypothetical service, user-profile-api, running on Kubernetes.
Layer 1: Infrastructure as Code (IaC)
This is your bedrock. If your infrastructure isn’t codified, it’s ephemeral and prone to drift.
- What it looks like:
TerraformorPulumiconfigurations defining your cloud resources,Ansibleplaybooks for server configuration,KubernetesYAMLs for your deployments, services, and ingress. - In action (Kubernetes example):
This YAML defines three replicas of your# deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: user-profile-api labels: app: user-profile-api spec: replicas: 3 selector: matchLabels: app: user-profile-api template: metadata: labels: app: user-profile-api spec: containers: - name: api image: your-docker-repo/user-profile-api:v1.2.0 ports: - containerPort: 8080 resources: requests: memory: "64Mi" cpu: "100m" limits: memory: "128Mi" cpu: "200m"user-profile-apicontainer, specifying resource requests and limits. Runningkubectl apply -f deployment.yamlmakes it real.
Layer 2: Observability (Metrics, Logs, Traces)
Once your infrastructure is defined, you need to see what’s happening within it. This isn’t just about dashboards; it’s about having the right signals to understand system health.
- What it looks like: Prometheus for metrics, Elasticsearch/Loki for logs, Jaeger/Tempo for traces.
- In action:
- Metrics: Your
user-profile-apiapplication exposes Prometheus metrics (e.g., HTTP request duration, error counts).
Prometheus scrapes these metrics, allowing you to build alerts like// Example using Prometheus client library in Go var ( httpRequestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests received.", }, []string{"method", "path", "status"}, ) ) // In your handler: httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()ALERTS IF sum(rate(http_requests_total{status=~"5.."} [5m])) by (path) > 10. - Logs: Structured logs (JSON) sent to a central aggregator.
{ "timestamp": "2023-10-27T10:30:00Z", "level": "INFO", "message": "User profile retrieved successfully", "user_id": "abc123xyz", "duration_ms": 15 } - Traces: Distributed tracing shows the journey of a request across services. A request to
user-profile-apimight involve calls toauth-serviceanddata-store. Tracing visualizes this entire flow.
- Metrics: Your
Layer 3: Automation (CI/CD, Incident Response)
With infrastructure defined and observable, you automate repetitive tasks and build resilience.
- What it looks like: GitHub Actions/GitLab CI for builds and deployments, automated rollback scripts, automated scaling policies.
- In action:
- CI/CD: A
git pushto yourmainbranch triggers a pipeline: build Docker image, push to registry, deploy to Kubernetes usingkubectl apply(or Helm/Kustomize). - Automated Response: If Prometheus alerts trigger an incident, an automation script could:
- Scale up replicas:
kubectl scale deployment user-profile-api --replicas=5 - If still unhealthy after 5 minutes, trigger a canary rollback:
kubectl rollout undo deployment user-profile-api
- Scale up replicas:
- CI/CD: A
Layer 4: SLOs and Error Budgets
This is where you define what "reliable" actually means for your service, quantitatively.
- What it looks like: Service Level Objectives (SLOs) based on error budget consumption.
- In action:
- SLO: For
user-profile-api, we define an SLO for availability: "99.95% of requests will succeed within 200ms over a rolling 28-day period." - Error Budget: This leaves us with a 0.05% error budget. If we burn through this budget too quickly (e.g., due to frequent 5xx errors or high latency), deployments might be paused, and engineering focus shifts to reliability fixes.
- Measurement: Prometheus queries can track this:
This metric, when compared against your 99.95% target, shows your current availability and error budget burn rate.(sum(rate(http_requests_total{status=~"2.."} [28d])) / sum(rate(http_requests_total[28d]))) * 100
- SLO: For
Layer 5: Resilience Patterns
These are architectural choices and patterns that make your system robust against failure.
- What it looks like: Circuit breakers, retries with exponential backoff, rate limiting, graceful degradation.
- In action:
- Circuit Breaker: If
user-profile-apistarts failing calls toauth-service(indicated by high error rate metrics fromauth-serviceor its own high latency),user-profile-api’s client library (e.g., Hystrix, Resilience4j) trips a circuit breaker. Subsequent calls toauth-serviceare immediately failed without hitting the network, preventing cascading failures and allowingauth-servicetime to recover. - Rate Limiting: To protect downstream services or itself from overload,
user-profile-apimight implement rate limiting based on user ID or IP address, returning429 Too Many Requestswhen limits are exceeded.
- Circuit Breaker: If
The most surprising thing about this hierarchy is how much it deviates from typical "agile" or "DevOps" thinking that often prioritizes rapid feature delivery. The SRE model explicitly states that without the lower layers being robust and automated, rapid feature delivery will inevitably lead to unreliable systems, burning through error budgets and causing customer pain.
The true power of this hierarchy is that each layer provides the necessary foundation and feedback loop for the layers above. IaC ensures your environment is repeatable, observability tells you if something is wrong, automation fixes it or alerts you, SLOs tell you how wrong is too wrong, and resilience patterns prevent minor issues from becoming catastrophic ones.
The next concept you’ll grapple with is defining meaningful error budgets for different SLOs simultaneously, and how to prioritize work when multiple error budgets are being consumed.