The most surprising thing about SRE’s "Golden Signals" is that they aren’t about performance at all. They’re about reliability.
Let’s see what that looks like in practice. Imagine a web service that handles user requests.
{
"request_id": "abc123xyz789",
"timestamp": "2023-10-27T10:30:00Z",
"service": "user_api",
"endpoint": "/users/profile",
"method": "GET",
"status_code": 200,
"latency_ms": 150,
"request_size_bytes": 512,
"response_size_bytes": 2048
}
This single log line tells us a few things immediately. It was a GET request to /users/profile on the user_api service, it completed successfully (status_code: 200), and it took 150 milliseconds (latency_ms: 150). This is a piece of traffic and latency data.
Now, let’s look at a slightly different one:
{
"request_id": "def456uvw012",
"timestamp": "2023-10-27T10:30:05Z",
"service": "user_api",
"endpoint": "/users/profile",
"method": "GET",
"status_code": 500,
"latency_ms": 75,
"error_type": "database_connection_failed",
"request_size_bytes": 512,
"response_size_bytes": 128
}
Here, the status_code is 500, indicating an error. The latency_ms is actually lower, 75ms, but that’s misleading. The request failed before it could complete its intended work. This is an error signal.
The Golden Signals are:
- Latency: The time it takes to serve a request. This isn’t just the average; you need percentiles (e.g., 95th, 99th) to catch outliers that impact a small but significant number of users. High latency means users are waiting, which erodes trust.
- Traffic: The demand placed on your system. This is typically measured in requests per second (RPS) or similar units. It tells you how busy your service is. A sudden drop in traffic can indicate a problem upstream or that users are leaving.
- Errors: The rate of requests that fail. This is usually a percentage of total requests. Even a small error rate (e.g., 0.1%) can affect thousands of users in a high-traffic system.
- Saturation: How "full" your service is. This is about resource utilization – CPU, memory, disk I/O, network bandwidth. When a service is saturated, it can’t handle more traffic, leading to increased latency and errors.
These four signals work together to give you a holistic view of your service’s health. You can’t just look at one in isolation. For instance, high traffic alone isn’t necessarily bad. If latency and errors remain low, your service is scaling well. But if high traffic is accompanied by rising latency and errors, it’s a clear sign of impending or current overload.
Consider a simple web application. You’d track the number of HTTP requests per second hitting your load balancer (Traffic). You’d measure the time from when the load balancer receives a request to when it sends the response back (Latency). You’d count the number of 5xx HTTP status codes returned to clients (Errors). And you’d monitor the CPU and memory utilization of your application servers (Saturation).
The real power comes when you correlate these. If your traffic spikes and your CPU utilization (Saturation) hits 95%, you’ll likely see your 95th percentile latency increase, and eventually, your error rate will climb as requests start timing out or being dropped.
The "Saturation" signal is often the trickiest to define and measure because it’s not directly tied to a request itself. It’s about the capacity of the underlying resources. For a web server, this might be sum(rate(node_cpu_seconds_total{mode!="idle"})[5m]) / count(node_cpu_seconds_total{mode!="idle"}) in Prometheus, which calculates the average CPU usage across all cores over the last 5 minutes. For a database, it could be the number of active connections versus the configured maximum.
When you’re building dashboards or alerts, you’re not just looking for absolute numbers. You’re looking for changes and trends. A sudden jump in 99th percentile latency from 200ms to 800ms is a critical alert. A steady increase in CPU saturation from 60% to 90% over an hour is a strong warning sign. A consistent 0.01% error rate is usually acceptable, but if it suddenly jumps to 1%, that’s an emergency.
The most counterintuitive part of the Golden Signals is that "Errors" are often the last signal to degrade under load, not the first. A system might start dropping requests (increasing latency) long before it explicitly returns a 5xx error code. This is because many systems have internal timeouts or queues that fill up, causing requests to be dropped or delayed silently before a hard error is generated.
The next concept you’ll need to grapple with is how to effectively define and alert on these signals, especially when dealing with distributed systems where a single logical request might traverse multiple services.