OpenTelemetry is the universal translator for distributed systems, letting you see what’s happening inside your applications regardless of what language they’re written in.

Let’s see it in action. Imagine a simple e-commerce checkout flow: a user adds an item, goes to checkout, enters their details, and pays. In a monolithic app, you’d just tail logs. But in a microservice world, that checkout might involve a frontend service, an order service, a payment service, and a user service. When a payment fails, where do you look?

Here’s a simplified trace of that checkout flow using OpenTelemetry, showing requests flowing between services:

{
  "traceId": "a1b2c3d4e5f67890",
  "spans": [
    {
      "traceId": "a1b2c3d4e5f67890",
      "spanId": "001",
      "parentSpanId": null,
      "name": "POST /checkout",
      "kind": "SERVER",
      "startTimeUnixNano": 1678886400000000000,
      "endTimeUnixNano": 1678886401000000000,
      "attributes": {
        "http.method": "POST",
        "http.url": "/checkout",
        "http.status_code": 200
      },
      "status": {"code": 0}
    },
    {
      "traceId": "a1b2c3d4e5f67890",
      "spanId": "002",
      "parentSpanId": "001",
      "name": "POST /orders",
      "kind": "CLIENT",
      "startTimeUnixNano": 1678886400100000000,
      "endTimeUnixNano": 1678886400500000000,
      "attributes": {
        "http.method": "POST",
        "http.url": "/orders",
        "http.status_code": 200
      },
      "status": {"code": 0}
    },
    {
      "traceId": "a1b2c3d4e5f67890",
      "spanId": "003",
      "parentSpanId": "002",
      "name": "POST /payments",
      "kind": "CLIENT",
      "startTimeUnixNano": 1678886400600000000,
      "endTimeUnixNano": 1678886400900000000,
      "attributes": {
        "http.method": "POST",
        "http.url": "/payments",
        "http.status_code": 500,
        "error.message": "Payment gateway timeout"
      },
      "status": {"code": 2}
    }
  ]
}

This JSON represents a trace, a collection of spans that show the journey of a request through your system. The traceId links all the spans together. Each span is a unit of work, like an HTTP request or a database query. The spanId uniquely identifies a span, and parentSpanId shows the causal relationship – spanId: 002 (calling /orders) was initiated by spanId: 001 (handling /checkout). kind tells us if this span was a SERVER (handling an incoming request) or CLIENT (making an outgoing request). Notice how the /payments span (spanId: 003) has a status.code of 2 (indicating an error) and an error.message attribute. This is the power of tracing: you can pinpoint exactly which service failed and why.

OpenTelemetry provides APIs, SDKs, and a collector. The APIs are language-specific libraries you use to instrument your code. The SDKs process the telemetry data locally, and the collector is an optional agent that can receive, process, and export telemetry data to various backends (like Jaeger, Prometheus, or cloud-native observability platforms).

The core problem OpenTelemetry solves is the "distributed monolith" – systems where services are independent but debugging requires understanding their interactions. Without instrumentation, you’re flying blind, relying on fragmented logs and guesswork. With it, you get a unified view, a breadcrumb trail for every request.

The mental model is simple: every meaningful operation in your system should emit a span. This includes:

  • Incoming requests: HTTP servers, gRPC servers, message queue consumers.
  • Outgoing requests: HTTP clients, gRPC clients, database drivers, message queue producers.
  • Internal operations: Long-running computations, critical function calls, specific business logic steps.

Instrumentation involves adding small code snippets to your application. For example, in Java using the OpenTelemetry SDK and an HTTP client library:

// Initialize OpenTelemetry SDK (simplified)
OpenTelemetrySdk sdk = OpenTelemetrySdk.builder()
    .setTracerProvider(SdkTracerProvider.builder().build())
    .setPropagators(ContextPropagators.create(TextMapPropagator.composite(W3CTraceContextPropagator.getInstance(), BaggagePropagator.getInstance())))
    .build();

// Get a tracer instance
Tracer tracer = sdk.getTracer("my-instrumented-app");

// Instrumenting an outgoing HTTP request
Span parentSpan = ...; // Assuming this is an incoming request's span
try (Scope scope = parentSpan.makeCurrent()) {
    HttpRequest request = HttpRequest.newBuilder()
        .uri("http://orders-service/orders")
        .POST(BodyPublishers.ofString("{\"item\": \"widget\"}"))
        .build();

    // OpenTelemetry automatically injects trace context into headers
    // and creates a client span for the request.
    HttpClient client = HttpClient.newBuilder().build();
    HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

    // You can add attributes to the client span
    Span currentSpan = tracer.spanBuilder("HTTP POST /orders").startSpan();
    currentSpan.setAttribute("http.method", "POST");
    currentSpan.setAttribute("http.url", "/orders");
    currentSpan.setAttribute("http.status_code", response.statusCode());
    if (response.statusCode() >= 400) {
        currentSpan.setStatus(StatusCode.ERROR);
        currentSpan.recordException(new Exception("HTTP Error: " + response.statusCode()));
    }
    currentSpan.end();
}

The most surprising thing about OpenTelemetry is how little manual effort is often required for common frameworks. Many languages have auto-instrumentation agents or libraries that automatically wrap popular libraries (like http, jdbc, kafka, etc.) without you needing to touch your application code. This means you can often get broad visibility with minimal code changes, just by starting your application with a specific Java agent or installing a package in Python.

The next concept you’ll encounter is how to effectively query and visualize this data, which leads into understanding distributed tracing backends like Jaeger or Tempo, and the role of metrics and logs in a complete observability strategy.

Want structured learning?

Take the full Sre course →