Beyond Uptime: True Reliability Metrics

The most surprising thing about SRE metrics is that the ones everyone talks about are often the least useful for actually improving reliability.

Let’s say you’re running a web service. You’ve got your standard dashboard: requests per second, CPU utilization, memory usage, network traffic. These are activity metrics. They tell you what the system is doing, but not how well it’s doing it from the user’s perspective.

Imagine this: your request rate is steady, CPU is at 30%, memory is fine. Looks good, right? But what if, for the last hour, 50% of those requests are actually failing, or taking 10 seconds to complete? Your activity metrics would look completely normal, but your users would be experiencing a catastrophic outage.

This is where reliability metrics come in. They focus on the user experience and the system’s ability to deliver on its promises. The core concept is the Service Level Objective (SLO). An SLO is a target value or range of values for a service level indicator (SLI).

An SLI is a quantitative measure of some aspect of the level of service that is provided. Common SLIs fall into a few categories:

Availability: Is the service up and responding?
Latency: How fast is the service responding?
Throughput: How much work can the service handle? (Less common for user-facing reliability, more for capacity planning).
Error Rate: What percentage of requests are failing?

Let’s look at a concrete example of how you’d measure and use these. Suppose you’re running an API that serves product information.

SLI: Availability

What to measure: The percentage of successful requests over a rolling 5-minute window. A "successful" request is one that returns an HTTP 2xx or 3xx status code.
How to measure: Most monitoring systems (Prometheus, Datadog, etc.) can track this. In Prometheus, you might have a metric like http_requests_total and http_responses_total{code=~"2..|3.."}, and you’d calculate the ratio.

Example Calculation (PromQL):

sum(rate(http_responses_total{code=~"2..|3.."}[5m])) by (instance)
/
sum(rate(http_requests_total[5m])) by (instance)
* 100

This gives you the percentage of successful requests per instance over the last 5 minutes.

SLI: Latency

What to measure: The latency of requests, specifically the 95th percentile (p95) latency over a rolling 5-minute window. This means 95% of requests are faster than this value.
How to measure: You need to instrument your application or use a proxy/load balancer that records request durations. Prometheus’s histogram_quantile function is invaluable here.
Example Calculation (PromQL):
```
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, instance))
```
This calculates the 95th percentile latency for requests on each instance over the last 5 minutes.

SLO: Putting it together

Now, you set an objective based on these SLIs.

SLO for Availability: 99.95% of requests are successful over a 30-day period.
SLO for Latency: 95% of requests complete in under 200ms over a 30-day period.

Notice the difference: the SLIs are measured over a short, rolling window (e.g., 5 minutes) to give you real-time feedback. The SLOs are measured over a longer period (e.g., 30 days) and represent the actual promise you’re making to your users. If you miss your SLO, you’ve failed to meet your commitment.

The key takeaway is that you should be tracking SLIs that directly reflect the user experience. For most web services, this means focusing on availability and latency of user-facing requests. Metrics like "CPU utilization" or "disk I/O" are inputs to reliability, but they aren’t reliability itself. You can have high CPU and still have a perfectly available and fast service, or low CPU and a completely unusable one.

When you observe a spike in error rates or latency, it’s critical to look at the underlying system metrics (CPU, memory, network, disk) to diagnose the cause. But the signal that something is wrong, the alarm bell that should ring, is a breach of your SLO, or a trend indicating you’re about to breach it.

The real power of SLIs and SLOs isn’t just in monitoring; it’s in driving engineering decisions. If you’re consistently failing your latency SLO, it’s a clear mandate to invest in performance improvements, caching, or scaling, rather than just adding more servers because the CPU looks "high." The SLO becomes the objective truth that guides your priorities and justifies the effort.

Many teams mistakenly set SLOs based on component availability rather than user-facing availability. For example, they might aim for 99.99% availability of their database. This sounds good, but if that database is only used for an optional feature, or if there’s a downstream service that fails to query it correctly, the user might still experience an outage even if the database itself is "up." Always tie your SLOs to the actual user journey and the business impact.

The next step is understanding error budgets and how they influence your release velocity and operational load.

Related Concepts

More Deep Dives in Reliability Engineering (SRE)