Service Level Objectives (SLOs) are the backbone of modern SRE, but most teams get them wrong by focusing on availability instead of user-facing success.
Let’s watch an SLO in action. Imagine a simple API that serves user profile data.
{
"service": "user-profile-api",
"endpoints": {
"/users/{id}": {
"method": "GET",
"success_codes": [200],
"error_codes": [400, 404, 500, 503],
"latency_threshold_ms": 200
}
},
"slo_targets": {
"availability": 99.9,
"latency_p95": 150
}
}
This config defines two SLOs for the /users/{id} GET endpoint:
- Availability: 99.9% of requests must return a success code (200).
- Latency (95th percentile): 95% of requests must complete within 150 milliseconds.
Now, let’s see how a request impacts these SLOs. A single successful GET request to /users/123 within 100ms increments both the "good" request count for availability and the "fast" request count for latency. Conversely, a 500 error, or a successful request that takes 300ms, would count as a "bad" event for its respective SLO. The system continuously tracks these counts over a rolling window (e.g., 30 days).
The core problem SLOs solve is the ambiguity in "is the service working?". Without them, "working" is a subjective debate between engineering and product. SLOs provide a shared, objective language. They force teams to quantify what good user experience actually means, not just whether the servers are technically "up." This means thinking about what users experience: Is the page loading fast enough? Is the data being returned correctly?
Internally, measuring SLOs typically involves a monitoring system that samples requests. For each request, it records:
- Status: Was it a success (e.g., HTTP 2xx) or an error (e.g., HTTP 5xx, network errors)?
- Latency: How long did the request take from the client’s perspective?
These measurements are aggregated over a defined period (e.g., 28 days). For availability, it’s (Total Successful Requests / Total Requests) * 100. For latency, it’s (Requests within Threshold / Total Requests) * 100 for the specified percentile. The "Total Requests" here is crucial – it’s not just successful requests, but all requests that were attempted.
The real power of SLOs comes from enforcement. When an SLO is breached (e.g., availability drops below 99.9%), it triggers an "error budget" depletion. This error budget is the inverse of the SLO target (e.g., 0.1% for availability). When the budget is gone, the team loses the "permission" to deploy new features. All engineering effort shifts to reliability work until the budget is replenished. This creates a direct, data-driven feedback loop between feature velocity and service stability.
Most people focus on the availability SLO, which is fundamentally about server uptime. However, a service can be 100% available but feel completely broken to users if it’s slow or returns incorrect data. The latency SLO, especially at higher percentiles like p95 or p99, is often a much better proxy for user experience. A slow response time, even if successful, means the user didn’t get what they wanted efficiently.
The next step after defining and measuring SLOs is to integrate them into your incident response and release processes.