Reliability Metrics: SLO, SLI, SLA Demystified

The most surprising truth about SLOs, SLIs, and SLAs is that they aren’t just jargon for "good enough"; they’re fundamentally different tools for managing user trust and system reliability, and most teams get them subtly wrong.

Let’s see this in action. Imagine a critical user journey: a customer searching for a product on an e-commerce site.

Here’s a simplified view of how we’d monitor this journey:

{
  "service": "product-search-api",
  "metrics": {
    "request_count": 15000,
    "error_count": 75,
    "latency_p95_ms": 450,
    "latency_p99_ms": 800
  },
  "timestamp": "2023-10-27T10:30:00Z"
}

This JSON snippet represents a snapshot of metrics from our product-search-api service at a specific moment. It tells us we handled 15,000 requests, saw 75 errors, and that 95% of requests completed within 450 milliseconds, while 99% took under 800 milliseconds.

Now, let’s break down how SLOs, SLIs, and SLAs fit into this picture.

SLI (Service Level Indicator): The Raw Measurement

An SLI is simply a quantitative measure of some aspect of the service’s performance. It’s the raw data you collect. In our example above, these are the SLIs:

Request Success Rate: (request_count - error_count) / request_count
Latency (p95): The 95th percentile of request latency.
Availability: Often measured as the percentage of successful requests over a given period.

For our product-search-api, a good SLI for success rate might be (15000 - 75) / 15000 = 99.5%. For latency, the p95 SLI is 450ms. These are just numbers; they don’t inherently tell you if things are "good" or "bad."

SLO (Service Level Objective): The Target

An SLO is a target value or range for an SLI over a specified period. It’s your goal for reliability. It answers the question: "What level of performance do we commit to providing our users?"

Based on our SLIs, we might set SLOs like:

Search Success Rate SLO: 99.9% of requests over a 30-day rolling window.
Search Latency SLO (p95): 500ms over a 30-day rolling window.

Notice the period (30-day rolling window). This is crucial. An SLO isn’t about a single moment; it’s about sustained performance. If our search success rate dips to 99.5% for a few minutes but recovers, it might not violate the 99.9% SLO over 30 days. However, if we have a prolonged outage or a consistent stream of errors, we will violate the SLO.

The real power of SLOs comes from setting them based on user experience, not just system health. If users abandon searches after 1 second, then a p95 latency of 500ms is a meaningful target for user satisfaction.

SLA (Service Level Agreement): The Contract

An SLA is a contract that defines the consequences if an SLO is not met. It’s the external promise, often with financial penalties. Think of it as the "promise to the customer" and what happens when that promise is broken.

For our e-commerce site, an SLA might state:

"If the Search Success Rate SLO (99.9% over 30 days) is breached, customers who purchased a product within 24 hours of a failed search will receive a 10% discount on their next order."
"If the Search Latency SLO (p95 < 500ms over 30 days) is breached, we will offer free expedited shipping on all orders placed in the subsequent month."

The critical distinction is that SLAs are almost always tied to SLOs. You don’t have an SLA for raw SLIs; you have an SLA for failing to meet an SLO. This is why you often hear "SLO is what you aim for, SLA is what happens when you miss it."

SLAs are typically set with a much larger buffer than SLOs. For instance, a company might have an SLO of 99.9% availability, but their SLA might only promise 99.5% availability, with penalties kicking in if they fall below that. This buffer accounts for the inherent difficulty of achieving perfect reliability and allows for some acceptable downtime or degraded performance without triggering contractual penalties.

The one thing most people don’t know is how expensive it can be to achieve extremely high levels of reliability. Going from 99.9% to 99.999% availability (often called "five nines") doesn’t just double or triple the cost; it can increase it by an order of magnitude or more due to the complexity of redundant systems, failover mechanisms, and continuous monitoring required. This is why SLOs are so important – they help you find the right balance between reliability and cost, and SLAs protect users when you fail to strike that balance.

Ultimately, SLOs and SLIs are internal tools to guide your engineering efforts, while SLAs are external commitments that manage customer expectations and risk. Understanding their distinct roles is key to building trustworthy and resilient systems.

The next concept you’ll grapple with is error budgets – what to do when you do miss your SLO.

Related Concepts

More Deep Dives in Observability & Monitoring