The most surprising thing about Service Level Objectives (SLOs) is that their primary purpose isn’t to ensure reliability, but to manage expectations about it.
Let’s see this in action. Imagine a hypothetical e-commerce checkout service. We want to track how often users successfully complete a purchase.
Here’s a simplified view of how we might collect this data:
{
"timestamp": "2023-10-27T10:00:01Z",
"event": "checkout_attempt",
"user_id": "user123",
"status": "success"
}
{
"timestamp": "2023-10-27T10:00:05Z",
"event": "checkout_attempt",
"user_id": "user456",
"status": "failure",
"error_code": "payment_timeout"
}
{
"timestamp": "2023-10-27T10:00:06Z",
"event": "checkout_attempt",
"user_id": "user789",
"status": "success"
}
From this stream of events, we can define our Service Level Indicator (SLI). A common SLI for a checkout service is "successful checkout rate." We calculate this by taking the count of checkout_attempt events with status: "success" over a given period and dividing it by the total count of checkout_attempt events in that same period.
For example, over one minute:
- Total
checkout_attemptevents: 150 - Successful
checkout_attemptevents: 145 - SLI = (145 / 150) * 100 = 96.67%
Now, we need to decide what level of success is acceptable for our users. This is where the Service Level Objective (SLO) comes in. An SLO is a target value or range for an SLI, usually over a specific time window. For our checkout service, we might set an SLO like: "99.9% of checkout attempts will be successful over a rolling 30-day period."
The Service Level Agreement (SLA) is the contractual commitment that often accompanies an SLO. It defines the consequences if the SLO is not met. This usually involves financial penalties, service credits, or other remedies for the customer. For instance, an SLA might state: "If the successful checkout rate falls below 99.9% over 30 days, the customer will receive a 10% credit on their next invoice." The SLA is what makes reliability a business concern, not just an engineering one.
The system works by continuously measuring the SLI (successful checkouts) and comparing it against the SLO target (99.9%). If the SLI dips below the SLO, it triggers an alert. This alert is the crucial signal that something is wrong and needs immediate attention. The system doesn’t fix the problem; it tells you there’s a problem by showing you the SLI deviating from the SLO.
Here’s how you might configure this in a monitoring system (conceptual, not specific syntax):
service: ecommerce-checkout
monitors:
- name: successful_checkout_rate
type: ratio
numerator:
query: "count(event{name='checkout_attempt', status='success'})"
denominator:
query: "count(event{name='checkout_attempt'})"
window: 30d
target_sli: 0.999
alert_threshold: 0.995 # Alert if it drops below this, giving room to recover before SLO breach
severity: critical
This configuration tells the system: "For the 'ecommerce-checkout' service, track the ratio of successful checkout attempts to total checkout attempts over the last 30 days. If this ratio ever drops below 99.5%, raise a critical alert. If it drops below 99.9% for the entire 30-day period, the SLO is breached."
The power of SLOs lies in their ability to create a shared understanding of reliability between engineering and business stakeholders. When an SLO is met, it means users are experiencing the service at an acceptable level of quality, and the business is protected from penalties. When it’s not met, it’s a clear, data-driven signal that resources must be prioritized to improve the system.
What most people miss is that the SLI isn’t a direct measure of user happiness, but a proxy for it. A high successful checkout rate usually means happy users, but it doesn’t account for other factors like checkout speed, UI friction, or payment option availability. Over-optimizing for a single SLI without considering the broader user experience can lead to a system that technically meets its objective but is frustrating to use.
The next step after defining and measuring your SLOs is understanding error budgets and how to use them to balance reliability with feature velocity.