Grafana Tempo’s metrics generator can churn out RED metrics for you, but the real magic is realizing you don’t need a separate tracing backend to get them.
Let’s see this in action. Imagine you have an application instrumented with OpenTelemetry, sending traces to Tempo. You can then query Tempo directly to calculate RED metrics: Rate, Errors, and Duration.
Here’s a typical scenario. Your application is a service that handles user requests. You want to know:
- Rate: How many requests are coming in per second?
- Errors: What percentage of those requests are failing?
- Duration: How long is it taking to process those requests on average?
Tempo, being a trace storage backend, has all the raw data. The metrics generator is essentially a sophisticated query engine that aggregates this trace data into meaningful metrics. It leverages the span information within your traces. Each span represents an operation within your service.
When you configure Tempo’s metrics generator, you’re telling it how to interpret these spans. You define "operations" which usually map to specific API endpoints or internal functions. For each operation, you can then extract:
- Request count: Simply count the number of spans that match your operation definition.
- Error count: Count spans that have an
errorattribute set totrueor specific error status codes. - Latency: Measure the duration of spans that match your operation.
The configuration looks something like this, defining how to identify an "HTTP POST /users" operation:
metrics_generator:
processors:
- type: "metrics/processor"
gauges:
# Rate: requests per second
- name: "http_requests_total"
description: "Total number of HTTP requests"
unit: "1"
attributes:
- name: "http.method"
value: "POST"
- name: "http.route"
value: "/users"
histograms:
# Duration: distribution of request latencies
- name: "http_request_duration_seconds"
description: "HTTP request duration distribution"
unit: "s"
attributes:
- name: "http.method"
value: "POST"
- name: "http.route"
value: "/users"
# Errors: count of failed requests
- name: "http_requests_failed_total"
description: "Total number of failed HTTP requests"
unit: "1"
attributes:
- name: "http.method"
value: "POST"
- name: "http.route"
value: "/users"
- name: "otel.status_code" # or 'error: true' depending on your instrumentation
value: "ERROR"
This configuration tells Tempo: "For any trace span where http.method is POST and http.route is /users, count it as a request. If otel.status_code is ERROR, count it as a failure. And for all of them, record their duration."
Tempo then exposes these metrics via its /metrics endpoint, typically in Prometheus format. You can scrape these metrics with a Prometheus server and visualize them in Grafana.
What’s often overlooked is that Tempo doesn’t need to be the sole source of truth for your metrics. It can act as a powerful bridge, generating your primary RED metrics from your trace data, and then you can augment this with other metrics sources (like application-level Prometheus exporters) for more granular insights. The key is that the trace data already contains this information; Tempo just makes it accessible as metrics.
The actual calculation of the "rate" metric involves a sliding window over the incoming spans, effectively counting spans per second. For "errors," it’s a simple ratio of failed spans to total spans within that window. Duration metrics are typically exposed as histograms, allowing you to calculate average, p95, and p99 latencies in your monitoring system.
The next step is to integrate these generated metrics with your alerting system to proactively identify issues.