SLIs: Beyond Uptime - Real Reliability Metrics

The most surprising thing about Service Level Indicators (SLIs) is that the best ones often measure something that isn’t what you directly control.

Let’s say you’re running a web service. You might think request_latency is the obvious SLI. But what if your service is just a thin wrapper around a downstream API? If that downstream API slows down, your request_latency SLI will also degrade, even if your own service is humming along perfectly. The real SLI that matters to your users is probably the latency of the full user journey, which includes that downstream call.

Here’s a simplified example of how we might track SLIs for a hypothetical "Product Catalog" service. This service fetches product data from a database and returns it.

// Example Go code snippet simulating an SLI metric collection
package main

import (
	"fmt"
	"math/rand"
	"net/http"
	"time"
)

// Simulate a database call
func fetchProductFromDB(productID string) (string, error) {
	// Simulate network latency and potential errors
	if rand.Float64() < 0.02 { // 2% chance of DB error
		return "", fmt.Errorf("database connection error for product %s", productID)
	}
	time.Sleep(time.Duration(rand.Intn(50)+10) * time.Millisecond) // 10-60ms latency
	return fmt.Sprintf("Product data for %s", productID), nil
}

// ProductHandler is our main HTTP handler
func ProductHandler(w http.ResponseWriter, r *http.Request) {
	startTime := time.Now()
	productID := r.URL.Query().Get("id")
	if productID == "" {
		http.Error(w, "Missing product ID", http.StatusBadRequest)
		// Record a "bad request" SLI event (e.g., count of 4xx errors)
		return
	}

	data, err := fetchProductFromDB(productID)
	if err != nil {
		http.Error(w, "Internal server error", http.StatusInternalServerError)
		// Record a "server error" SLI event (e.g., count of 5xx errors)
		return
	}

	// Record a "request latency" SLI event
	latency := time.Since(startTime)
	fmt.Fprintf(w, "%s\n", data)

	// In a real system, these would be sent to a metrics system like Prometheus
	// e.g., record_success_request(latency) or record_error_request_type("5xx")
}

func main() {
	http.HandleFunc("/products", ProductHandler)
	fmt.Println("Starting product catalog service on :8080")
	http.ListenAndServe(":8080", nil)
}

In this example, we have a few potential SLIs:

request_latency: The time from when our ProductHandler receives a request to when it sends a response.
error_rate: The percentage of requests that result in a 4xx or 5xx HTTP status code.
availability: A boolean indicating whether the service successfully processed a request (i.e., didn’t return a 5xx error).

The problem this system solves is providing objective, measurable targets for service reliability. Instead of vague goals like "the API should be fast," SLIs give us concrete numbers: "99.9% of requests should complete within 200ms," or "fewer than 0.1% of requests should result in a 5xx error." This allows SRE teams to make data-driven decisions about where to invest their time and resources for maximum reliability impact.

Internally, SLIs are typically implemented by instrumenting your application code. For every relevant event (a successful request, a failed request, a specific duration), you increment counters or record timings in a metrics library. This library then exposes these metrics over an endpoint (like /metrics for Prometheus) that a monitoring system can scrape. The monitoring system then aggregates these raw events into the SLIs you’ve defined.

The exact levers you control are:

What you measure: The specific events you instrument in your code. Are you measuring the latency of just your code, or the end-to-end latency including external dependencies?
How you aggregate: The calculation you perform on the raw events. This could be a simple count, a sum, a histogram (for latency distributions), or a ratio (like errors/total requests).
The thresholds: The target values for your SLIs, which form your Service Level Objectives (SLOs).

A common misconception is that an SLI must be a direct measure of your system’s internal state. For example, you might think database_query_duration is a perfect SLI for our catalog service. However, if your database is slow but your application server is still responding quickly by returning cached data or an error, that SLI might look good while the user experience is terrible. The actual SLI that reflects user experience is often the latency of the successful response, which implicitly includes the database’s performance as experienced by the user. This means you’re measuring the outcome for the user, not just the health of one component.

Once you’ve mastered basic SLIs like latency and error rates, you’ll start thinking about more sophisticated ones, like "saturation" or "freshness" of data.

Related Concepts

More Deep Dives in Reliability Engineering (SRE)