MTTR Hacks: Observability for Faster Incident Resolution

Distributed tracing is the unsung hero of debugging microservices, but most teams only use it to find where a request failed, not why it failed so slowly.

Imagine a user reports their dashboard is sluggish. You’ve got your observability tools fired up.

Here’s a typical request flow for loading that dashboard:

Frontend (Browser): User clicks "Load Dashboard." Makes an API call to dashboard-service.
dashboard-service: Receives the request. Needs user preferences and widget data.
- Calls user-service for preferences.
- Calls widget-service for data.
user-service: Fetches user preferences from a user-db.
widget-service: Fetches widget configurations from widget-config-db and then aggregates data from multiple downstream services (data-aggregator-a, data-aggregator-b).
data-aggregator-a: Fetches data from metric-db-a.
data-aggregator-b: Fetches data from metric-db-b.

Your tracing system shows you this entire flow, each arrow representing a request between services. You can see the total time and the time spent in each service.

{
  "traceId": "abc123xyz",
  "spanId": "span-1",
  "operationName": "GET /dashboard",
  "startTime": "2023-10-27T10:00:00Z",
  "duration": 1500, // ms
  "tags": {
    "http.method": "GET",
    "http.url": "/dashboard"
  },
  "children": [
    {
      "spanId": "span-2",
      "operationName": "GET /users/{id}/preferences",
      "startTime": "2023-10-27T10:00:00.100Z",
      "duration": 250, // ms
      "tags": {
        "http.method": "GET",
        "http.url": "/users/{id}/preferences"
      },
      "children": [
        {
          "spanId": "span-3",
          "operationName": "user-db.query",
          "startTime": "2023-10-27T10:00:00.150Z",
          "duration": 100, // ms
          "tags": {
            "db.type": "postgresql"
          }
        }
      ]
    },
    {
      "spanId": "span-4",
      "operationName": "GET /widgets/data",
      "startTime": "2023-10-27T10:00:00.300Z",
      "duration": 1000, // ms
      "tags": {
        "http.method": "GET",
        "http.url": "/widgets/data"
      },
      "children": [
        {
          "spanId": "span-5",
          "operationName": "widget-config-db.query",
          "startTime": "2023-10-27T10:00:00.350Z",
          "duration": 50, // ms
          "tags": {
            "db.type": "redis"
          }
        },
        {
          "spanId": "span-6",
          "operationName": "GET /data/aggregator-a",
          "startTime": "2023-10-27T10:00:01.000Z",
          "duration": 300, // ms
          "tags": {
            "http.method": "GET",
            "http.url": "/data/aggregator-a"
          },
          "children": [
            {
              "spanId": "span-7",
              "operationName": "metric-db-a.query",
              "startTime": "2023-10-27T10:00:01.100Z",
              "duration": 200, // ms
              "tags": {
                "db.type": "influxdb"
              }
            }
          ]
        },
        {
          "spanId": "span-8",
          "operationName": "GET /data/aggregator-b",
          "startTime": "2023-10-27T10:00:01.400Z",
          "duration": 400, // ms
          "tags": {
            "http.method": "GET",
            "http.url": "/data/aggregator-b"
          },
          "children": [
            {
              "spanId": "span-9",
              "operationName": "metric-db-b.query",
              "startTime": "2023-10-27T10:00:01.500Z",
              "duration": 300, // ms
              "tags": {
                "db.type": "prometheus"
              }
            }
          ]
        }
      ]
    }
  ]
}

This trace clearly shows GET /dashboard took 1500ms. The dashboard-service itself took 1500ms (from its start time to the end of its last child span). The user-service call took 250ms, and the widget-service call took 1000ms. Within widget-service, aggregator-a took 300ms and aggregator-b took 400ms.

But what if widget-service spent 800ms waiting for aggregator-b to respond, and aggregator-b itself only spent 200ms processing and 200ms querying metric-db-b? The trace above shows widget-service duration as 1000ms, and aggregator-b duration as 400ms. The difference is network latency and queueing time within the service before it started processing the request.

This is where logs become critical. Logs provide the granular, internal state of a service. When a trace shows a service took 1000ms to process a request, you need logs to see what it was doing during that time.

Let’s say you’ve instrumented your services to log key events with the traceId and spanId attached.

widget-service logs (filtered by traceId: abc123xyz, spanId: span-4):

{
  "timestamp": "2023-10-27T10:00:00.300Z",
  "level": "INFO",
  "message": "Received request for widget data",
  "traceId": "abc123xyz",
  "spanId": "span-4"
}
{
  "timestamp": "2023-10-27T10:00:00.350Z",
  "level": "INFO",
  "message": "Fetched widget configurations from redis",
  "traceId": "abc123xyz",
  "spanId": "span-4",
  "duration_ms": 50
}
{
  "timestamp": "2023-10-27T10:00:00.400Z",
  "level": "INFO",
  "message": "Initiating call to data aggregator A",
  "traceId": "abc123xyz",
  "spanId": "span-4"
}
{
  "timestamp": "2023-10-27T10:00:01.000Z",
  "level": "INFO",
  "message": "Received response from data aggregator A",
  "traceId": "abc123xyz",
  "spanId": "span-4",
  "duration_ms": 300 // This is the duration of the *call* to aggregator A, matching the trace.
}
{
  "timestamp": "2023-10-27T10:00:01.050Z",
  "level": "INFO",
  "message": "Initiating call to data aggregator B",
  "traceId": "abc123xyz",
  "spanId": "span-4"
}
// --- MISSING LOG ENTRY HERE ---
// The next log entry would be the response from aggregator B.
// If there's a gap between "Initiating call" and "Received response",
// that's your waiting time.
{
  "timestamp": "2023-10-27T10:00:01.850Z", // Notice this timestamp
  "level": "INFO",
  "message": "Received response from data aggregator B",
  "traceId": "abc123xyz",
  "spanId": "span-4",
  "duration_ms": 400 // This is the duration of the *call* to aggregator B, matching the trace.
}
{
  "timestamp": "2023-10-27T10:00:01.900Z",
  "level": "INFO",
  "message": "Finished processing widget data",
  "traceId": "abc123xyz",
  "spanId": "span-4"
}

The trace told you widget-service (span span-4) took 1000ms. The logs reveal that from the time it initiated the call to aggregator-b (10:00:01.050Z) until it received the response (10:00:01.850Z), there was a 800ms gap. This gap is the time the widget-service was blocked, waiting for aggregator-b. The trace correctly shows the total duration of the call to aggregator-b as 400ms, but the logs show the internal waiting time within widget-service for that call was 800ms. This is a critical distinction: the trace shows how long the operation took from the caller’s perspective, while logs show what the service was doing (or not doing) during that time.

The true power comes from correlating them. When a trace shows a long duration for a specific span, you can jump to the logs for that exact traceId and spanId. The logs then reveal the internal processing, including any waits, retries, or specific errors that happened within that span’s execution.

For example, if aggregator-b had a log like this:

aggregator-b logs (filtered by traceId: abc123xyz, spanId: span-8):

{
  "timestamp": "2023-10-27T10:00:01.400Z",
  "level": "INFO",
  "message": "Received request for aggregated metrics",
  "traceId": "abc123xyz",
  "spanId": "span-8"
}
{
  "timestamp": "2023-10-27T10:00:01.500Z",
  "level": "INFO",
  "message": "Querying prometheus",
  "traceId": "abc123xyz",
  "spanId": "span-8"
}
{
  "timestamp": "2023-10-27T10:00:01.700Z",
  "level": "WARN",
  "message": "Prometheus query timed out after 200ms, retrying",
  "traceId": "abc123xyz",
  "spanId": "span-8",
  "duration_ms": 200 // Time spent on the first query attempt
}
{
  "timestamp": "2023-10-27T10:00:01.900Z",
  "level": "INFO",
  "message": "Prometheus query successful on retry",
  "traceId": "abc123xyz",
  "spanId": "span-8",
  "duration_ms": 200 // Time spent on the second query attempt
}
{
  "timestamp": "2023-10-27T10:00:01.950Z",
  "level": "INFO",
  "message": "Finished processing aggregated metrics",
  "traceId": "abc123xyz",
  "spanId": "span-8"
}

This log snippet from aggregator-b shows that its 400ms duration (as seen in the trace) was actually composed of two 200ms queries to Prometheus, with a retry in between. This level of detail, linking internal execution steps (logs) with the overall request flow (traces), is what makes debugging distributed systems manageable. You see the slow request in the trace, identify the slow service, and then dive into that service’s logs using the trace context to see exactly what it was doing during that slow period.

The most powerful correlation happens when your logging framework automatically injects trace context (traceId, spanId) into every log message. This allows you to search your entire log volume for all messages related to a specific slow trace, or even a specific slow span within that trace. Without this, you’re manually sifting through log files, which is a recipe for burnout.

The next step is understanding how to aggregate and query these correlated traces and logs at scale.

Related Concepts

More Deep Dives in Observability & Monitoring