The most surprising thing about SRE Golden Signals is that they aren’t about how well your system is performing, but how consistently it’s performing.
Let’s watch a request flow through a simple web service and see these signals in action. Imagine a user requests a product page.
-
Latency: This is the time it takes to serve a request. We’re not just looking at the average, but the distribution. If 99% of requests are served in 100ms, but a few take 5 seconds, your users will experience slowness.
# Example: Using Prometheus to query latency percentiles # This query shows the 95th percentile latency for HTTP requests to the '/products' endpoint histogram_quantile(0.95, sum by (le, uri) (rate(http_request_duration_seconds_bucket{uri="/products"}[5m])))This command shows you the "tail" of your latency distribution. If this number is creeping up, it means a growing percentage of your users are hitting those slower requests.
-
Traffic: This is the demand placed on your system. It’s usually measured in requests per second (RPS) or transactions per second. High traffic isn’t inherently bad, but it’s a crucial context for other signals.
# Example: Prometheus query for incoming RPS to the web service sum(rate(http_requests_total{job="webserver"}[5m]))This shows the raw incoming load. If traffic spikes unexpectedly, you need to check if your other signals are holding steady.
-
Errors: This is the rate of requests that fail. It’s not just HTTP 5xx errors; it includes application-level errors that don’t manifest as standard HTTP error codes.
# Example: Prometheus query for HTTP 5xx errors sum(rate(http_requests_total{job="webserver", code=~"5.."}[5m]))This query specifically targets server-side errors. If this rate rises above your SLO (e.g., 0.1%), it’s a clear indicator of a problem.
-
Saturation: This is how "full" your service is. It’s a proxy for how close your service is to its capacity limit. It’s often measured by resource utilization (CPU, memory, disk I/O, network bandwidth) or by queue lengths.
# Example: Prometheus query for CPU utilization of webserver pods avg by (pod) (rate(container_cpu_usage_seconds_total{container="webserver", namespace="production"}[5m])) / avg by (pod) (kube_pod_container_resource_limits{container="webserver", namespace="production", resource="cpu"}) * 100This shows the percentage of CPU allocated to the webserver container that’s actually being used. If this consistently hovers above 80-90%, you’re heading for trouble.
Putting it all together, you’re not just looking at these numbers in isolation. You’re looking for correlations. A sudden spike in traffic (Traffic) might lead to increased latency (Latency) and potentially errors (Errors) if the system can’t scale fast enough, pushing a resource towards saturation (Saturation).
The real magic happens when you combine these signals with your Service Level Objectives (SLOs). An SLO is a target for a specific service level, like "99.9% of requests should have a latency under 500ms" or "error rate should be below 0.1%." When your Golden Signals consistently meet or exceed your SLOs, your service is healthy. When they dip below, you have an incident.
Many teams overlook that saturation isn’t just about hitting 100% utilization. It’s about the point where performance degrades. For some systems, performance starts to suffer well before CPU hits 100%. This could be due to increased context switching, garbage collection pauses, or contention for shared resources. Identifying this "performance saturation" point, which might be at 70% CPU, is often more critical than waiting for the hard limit.
The next step is understanding how to set effective SLOs based on these signals and your user experience.