Grafana Tempo’s remote write functionality for span metrics is essentially a highly efficient pipeline designed to ingest and process trace data, specifically focusing on extracting metrics derived from spans.
Let’s dive into how it actually works. Imagine you have a distributed tracing system (like Jaeger, OpenTelemetry, or even Tempo itself) generating spans. These spans contain rich information about requests as they traverse your services: duration, status, service name, operation name, and various tags. Tempo’s remote write feature allows these tracing backends to send this span data directly to Tempo. But Tempo doesn’t just store raw spans; it has a specialized pipeline that can transform this span data into metrics. This is the "Span Metrics Pipeline."
Here’s a simplified look at a typical flow:
- Span Ingestion: Your tracing backend (e.g., an OpenTelemetry Collector configured with
otlpexporter) sends spans to Tempo’s ingester. Tempo can accept spans via its native Tempo protocol, Jaeger’s Thrift/Protobuf, or OpenTelemetry Protocol (OTLP). - Span Storage: Tempo stores these raw spans, making them searchable and viewable in Grafana.
- Metrics Extraction (The Pipeline): This is where the magic for span metrics happens. Tempo, or more commonly, an intermediary like the OpenTelemetry Collector, processes these incoming spans. It looks for specific attributes or the inherent properties of spans (like duration) and aggregates them into metrics.
- Example: A common metric extracted is
http.server.duration(or similar, depending on your instrumentation). This metric is derived from the duration of spans that represent HTTP server requests. Other metrics might include counts of requests per service/operation, error rates (spans with an error status), or latency percentiles.
- Example: A common metric extracted is
- Metric Export: These extracted metrics are then exported, usually to a Prometheus-compatible endpoint (like Tempo’s own metrics endpoint or a separate Prometheus server).
This pipeline allows you to gain operational insights from your traces without needing to instrument your applications again specifically for metrics. You get metrics like request rates, error rates, and latency distributions directly from your existing trace data.
How to Set It Up (The Core Idea)
The most common and flexible way to leverage Tempo’s span metrics pipeline is not by having Tempo itself do the heavy lifting of metric generation from spans. Instead, you typically use the OpenTelemetry Collector.
Here’s the mental model:
- Tracing Backend: Generates spans.
- OpenTelemetry Collector (OTel Collector):
- Receives spans from the tracing backend.
- Processes these spans through a receiver (e.g.,
otlp). - Applies transformations and metric generation using a processor (e.g.,
spanmetrics). - Exports the generated metrics to a Prometheus-compatible endpoint.
- (Optionally) Exports the raw spans to Tempo for storage and debugging.
- Prometheus/Mimir/Loki: Scrapes and stores the generated metrics.
- Grafana: Visualizes both traces (from Tempo) and metrics (from Prometheus/Mimir).
Live Example: OTel Collector Configuration
Let’s say you’re using the OTel Collector to receive spans from your applications (via OTLP) and you want to generate metrics from them.
receivers:
otlp:
protocols:
grpc:
http:
processors:
spanmetrics:
# This is the core processor for span metrics.
# It needs to know how to group and aggregate spans into metrics.
# 'aggregation_temporality' is crucial for Prometheus compatibility.
aggregation_temporality: DELTA # Or AGGREGATE for Prometheus Remote Write
# 'latency_histogram_buckets' defines the buckets for latency metrics.
# These are standard Prometheus histogram buckets.
latency_histogram_buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 2.5, 5.0, 7.5, 10.0, 20.0, 40.0, 60.0]
# 'enable_target_info' adds target information to metrics.
enable_target_info: true
# 'actions' define how to transform span attributes into metrics.
# This is where you configure WHICH metrics you want.
actions:
# Example 1: Measure HTTP server request duration.
# Looks for spans with 'http.method' and 'http.route' attributes.
# Generates a histogram of durations.
- key: http.server.request
action: generate_histogram
dimensions:
- http.method
- http.route
- http.status_code
- service.name
# Example 2: Count incoming requests by operation.
# Looks for spans with 'rpc.method' and 'service.name' attributes.
# Generates a counter.
- key: rpc.call
action: generate_count
dimensions:
- rpc.method
- service.name
# Example 3: Measure client request duration.
- key: http.client.request
action: generate_histogram
dimensions:
- http.method
- url.full
- service.name
exporters:
prometheus:
endpoint: "0.0.0.0:8889" # Exposes metrics on this port
service:
pipelines:
traces:
receivers: [otlp]
processors: [spanmetrics] # Apply the spanmetrics processor
exporters: [] # You might export traces to Tempo here too
metrics:
receivers: [] # Metrics are generated by the spanmetrics processor
processors: []
exporters: [prometheus] # Export the generated metrics
In this configuration:
- The
spanmetricsprocessor is enabled. aggregation_temporality: DELTAis often used when exporting to Prometheus directly, as Prometheus handles the aggregation. If you were exporting to Mimir or another system that expects pre-aggregated data, you might useAGGREGATE.- The
actionssection is critical. It tells thespanmetricsprocessor:- When it sees a span with certain attributes (like
http.method,http.route), it should perform an action. generate_histogramwill create a Prometheus histogram metric (e.g.,http_server_request_duration_seconds_bucket,_count,_sum).generate_countwill create a Prometheus counter metric (e.g.,rpc_call_total).dimensionsspecify which span attributes should be used as labels for these metrics.
- When it sees a span with certain attributes (like
This collector would then expose metrics on http://localhost:8889/metrics. You’d configure Prometheus to scrape this endpoint.
The "Aha!" Moment: Why This Isn’t Just Tempo
The most surprising truth about Tempo’s span metrics pipeline is that Tempo itself rarely generates the metrics. While Tempo can be configured to do some basic metric extraction if you send it spans directly and have specific ingester configurations, the robust, flexible, and idiomatic way to achieve this is by using the OpenTelemetry Collector. The Collector acts as the intelligent intermediary, transforming raw trace data into Prometheus-readable metrics before they even hit a metric storage system. This separation of concerns is key: Tempo is for traces, Prometheus is for metrics, and the OTel Collector bridges them.
This approach also means you can send your spans to Tempo for archival and analysis, and simultaneously extract metrics from them using the OTel Collector, all from the same source of spans.
The next problem you’ll likely encounter is figuring out how to correlate these metrics back to individual traces effectively in Grafana.