Jaeger’s distributed tracing can feel like magic, but it’s actually a meticulously orchestrated dance of telemetry data across your services.

Let’s see it in action. Imagine a user requests a product. This request doesn’t just hit one server; it might spin through a frontend API, then to a product catalog service, a pricing service, and finally a database. Each of these steps generates trace data – a span representing the work done by that specific service. Jaeger collects these spans, stitches them together into a complete trace, and visualizes the entire journey.

Here’s a simplified trace snippet you might see in the Jaeger UI for a product lookup:

[
  {
    "traceID": "a1b2c3d4e5f67890",
    "spans": [
      {
        "traceID": "a1b2c3d4e5f67890",
        "spanID": "1122334455667788",
        "operationName": "GET /products/{id}",
        "startTime": "2023-10-27T10:00:00.123Z",
        "duration": "150ms",
        "tags": [
          {"key": "http.method", "type": "string", "value": "GET"},
          {"key": "http.url", "type": "string", "value": "/products/123"},
          {"key": "span.kind", "type": "string", "value": "server"}
        ],
        "logs": [...]
      },
      {
        "traceID": "a1b2c3d4e5f67890",
        "spanID": "9988776655443322",
        "parentSpanID": "1122334455667788",
        "operationName": "product-catalog-service.getProduct",
        "startTime": "2023-10-27T10:00:00.130Z",
        "duration": "80ms",
        "tags": [
          {"key": "service.name", "type": "string", "value": "product-catalog-service"},
          {"key": "db.statement", "type": "string", "value": "SELECT * FROM products WHERE id = '123'"}
        ],
        "logs": [...]
      },
      {
        "traceID": "a1b2c3d4e5f67890",
        "spanID": "aabbccddeeff0011",
        "parentSpanID": "9988776655443322",
        "operationName": "pricing-service.getEffectivePrice",
        "startTime": "2023-10-27T10:00:00.150Z",
        "duration": "50ms",
        "tags": [
          {"key": "service.name", "type": "string", "value": "pricing-service"}
        ],
        "logs": [...]
      }
    ]
  }
]

In this snippet, the main request (GET /products/{id}) is the parent span. It then calls product-catalog-service.getProduct, which in turn calls pricing-service.getEffectivePrice. Notice the traceID is the same for all spans in the trace, and parentSpanID links child spans to their parents. This hierarchical structure is what allows Jaeger to reconstruct the request flow.

The fundamental problem Jaeger solves is the "distributed monolith" – a system where services are independent but their interactions are opaque. Without tracing, debugging a slow request or an error across multiple services becomes a nightmare of correlating logs from disparate systems. Jaeger provides a single pane of glass to see the entire lifecycle of a request.

Setting up Jaeger involves three main components: the Agent, the Collector, and the UI. The Agent runs on your hosts and receives spans from instrumented applications. It batches these spans and forwards them to the Collector. The Collector validates, indexes, and stores the traces in a backend like Elasticsearch or Cassandra. The UI provides the visualization and querying capabilities.

The key to getting this working is instrumentation. Your application code needs to be aware of the tracing context. Libraries like OpenTelemetry (which Jaeger supports) or Jaeger’s own client libraries do this by:

  1. Starting a trace: When a request enters a service, it either starts a new trace or continues an existing one based on incoming headers (like traceparent).
  2. Creating spans: For each operation (e.g., an HTTP call, a database query), a new span is created, linked to the current trace and its parent span.
  3. Propagating context: Crucially, when one service calls another, the traceID, spanID, and other context information are injected into the outgoing request headers. This allows the downstream service to create a child span that’s correctly linked.

You control the granularity of your tracing by how you instrument your code. You can create spans for entire request handlers, individual function calls, or even specific database queries. The tags and logs fields on spans are invaluable for adding context – think http.status_code, db.statement, or custom error messages.

The most surprising thing most people don’t realize is how much overhead distributed tracing can add if not managed. While the OpenTelemetry SDKs are generally efficient, simply enabling tracing everywhere without thought can lead to a significant increase in network traffic and storage costs. It’s crucial to configure sampling strategies (e.g., always sample critical paths, probabilistically sample others) and to ensure your trace backend can scale to handle the ingestion rate.

The next step in mastering distributed tracing is understanding how to effectively query and analyze your traces to pinpoint performance bottlenecks and root causes of errors.

Want structured learning?

Take the full Sre course →