The most surprising thing about Tempo’s Service Graph is that it doesn’t actually collect any metrics itself; it derives them from your traces.
Let’s see this in action. Imagine you have a simple HTTP service. When a request comes in, your application generates a trace. This trace has spans representing different operations, like "receive HTTP request," "lookup user in database," and "send HTTP response."
Here’s a snippet of what a trace might look like in Tempo, simplified:
{
"traceID": "abc123xyz",
"spans": [
{
"traceID": "abc123xyz",
"spanID": "1",
"parentSpanID": "",
"operationName": "GET /users/{id}",
"startTime": "2023-10-27T10:00:00Z",
"duration": 50000000, // 50ms
"tags": {
"http.method": "GET",
"http.url": "/users/123",
"http.status_code": 200
}
},
{
"traceID": "abc123xyz",
"spanID": "2",
"parentSpanID": "1",
"operationName": "DB Query",
"startTime": "2023-10-27T10:00:00.010Z",
"duration": 20000000, // 20ms
"tags": {
"db.system": "postgresql",
"db.statement": "SELECT * FROM users WHERE id = 123"
}
}
]
}
The Service Graph in Grafana takes these traces and aggregates information. For a given service (identified by its name, often from a service.name or job tag), it looks at all incoming requests.
Latency: Tempo calculates latency by looking at the duration of the root span for each request. In the example above, the root span is GET /users/{id} with a duration of 50ms. If Tempo finds 100 such traces for your users-api service in a 5-minute window, it can calculate percentiles for latency: 50th percentile (median), 95th percentile, 99th percentile, etc.
Error Rates: Error rates are derived from status codes or specific error tags. For HTTP services, a http.status_code of 5xx often indicates an error. If a trace’s root span has a http.status_code of 500, it’s counted as an error. Tempo sums up the total number of requests and the number of "error" requests within a time window to compute an error rate (e.g., (error_count / total_count) * 100).
How it works internally: Tempo itself stores the trace data. Grafana, when you navigate to the Service Graph dashboard, queries Tempo for traces within a specified time range. It then uses its own processing engine (often based on Prometheus query language capabilities if you’re using Grafana’s metrics features alongside Tempo) to aggregate these traces into the service graph view. It’s essentially performing a "trace-to-metric" conversion on the fly, or using pre-aggregated metrics if you have a Prometheus or similar system also scraping metrics from your services. The key is that the source of truth for the graph’s data is the trace data in Tempo.
The "services" and "operations" you see in the graph are directly mapped from the service.name and operationName (or similar semantic conventions) within your trace spans. If your traces don’t have consistent service.name tags, your graph will be a mess of unknown or misidentified services.
The way Tempo’s Service Graph correlates requests between services is by examining parentSpanID and spanID relationships. When a span from service-a calls service-b, the span for the call in service-a will have a spanID that becomes the parentSpanID for the incoming span in service-b. This parent-child relationship is what builds the directed edges between nodes (services) in the graph.
The most common pitfall is relying on default or inconsistent service naming. If your backend service sometimes reports its name as users-backend and other times as user-service, Tempo will treat these as two distinct services in the graph, making it impossible to get a holistic view of your user management system’s performance. You need to enforce a strict, consistent convention for the service.name attribute across all your instrumented applications.
The next hurdle you’ll likely encounter is understanding how to filter and drill down into specific error types beyond just HTTP 5xx codes, which requires richer span tagging.