SLOs aren’t just metrics; they’re the contractual agreement between users and the systems serving them.
Consider a hypothetical e-commerce platform. We want to ensure users can reliably browse products, add them to their cart, and complete purchases.
Here’s a simplified view of the services involved:
product-catalog-service: Serves product information (details, pricing, images).cart-service: Manages user shopping carts.checkout-service: Handles the payment and order finalization process.
Let’s define SLOs for the product-catalog-service. We’ll focus on two key aspects: availability and latency.
# Example SLO definition for product-catalog-service
---
service_name: product-catalog-service
slo_version: 1
availability:
short_name: avail
description: "Percentage of successful product catalog requests"
target: 99.95%
window: 28d # Rolling 28-day window
measurement:
# This would typically come from an observability system (e.g., Prometheus, Datadog)
# It's a count of total requests and a count of successful requests.
# For example: (sum(requests_total) - sum(requests_failed)) / sum(requests_total)
success_criteria: "request_status_code_is_2xx OR request_status_code_is_3xx"
failure_criteria: "request_status_code_is_5xx OR request_status_code_is_4xx"
latency:
short_name: lat95
description: "95th percentile latency for product catalog requests"
target: 200ms
window: 28d # Rolling 28-day window
measurement:
# This would also come from an observability system.
# It's the 95th percentile of the `request_duration_seconds` metric.
# For example: histogram_quantile(0.95, sum(rate(request_duration_seconds_bucket[5m])) by (le))
distribution: "histogram"
percentile: 0.95
threshold_ms: 200
The system might be instrumented with Prometheus. Requests to product-catalog-service on http://localhost:8080/products would generate metrics like:
http_requests_total{method="GET", path="/products", status_code="200"}http_requests_total{method="GET", path="/products", status_code="500"}http_request_duration_seconds_bucket{le="0.1", path="/products"}http_request_duration_seconds_bucket{le="0.2", path="/products"}http_request_duration_seconds_bucket{le="0.5", path="/products"}
An observability platform (like Grafana with Prometheus) would then query these metrics to calculate the SLOs:
Availability Calculation:
Imagine over the last 28 days, we had 100 billion requests to the product-catalog-service.
Out of those, 100 million were 5xx errors, and 50 million were 4xx errors.
Total bad requests = 150 million.
Total requests = 100 billion.
Success rate = (100,000,000,000 - 150,000,000) / 100,000,000,000 = 99.9985%
This is above our target of 99.95%, so the availability SLO is met.
Latency Calculation:
The 95th percentile latency is calculated from the http_request_duration_seconds_bucket metric. If the 95th percentile of all requests served in the last 28 days was 180ms, this SLO is also met.
The critical insight is that SLOs aren’t just for reporting; they are for driving action. If the availability SLO dips to 99.94%, it triggers an incident response. Engineers are paged, and their immediate priority shifts from feature development to restoring the SLO. This creates a clear, objective mechanism for prioritizing reliability work. The "error budget" – the amount of downtime or latency exceeding the SLO target – becomes a tangible quantity. If the error budget is spent, new features are paused until the SLO is back on track.
The most counterintuitive aspect of SLOs is that they are intentionally set to be not 100%. Aiming for 100% availability is often an engineering impossibility and leads to over-provisioning, unnecessary complexity, and a stifling of innovation. By accepting a small, quantifiable amount of unreliability (the error budget), teams can iterate faster, knowing they have a buffer. This buffer allows for planned maintenance, occasional outages, and the inherent imperfections of complex distributed systems without jeopardizing the user experience for the vast majority of the time.
The next step is understanding how to use the error budget.