High cardinality in Grafana Tempo isn’t about the number of traces, but the variety of attribute values within those traces.
Let’s see it in action. Imagine you’re tracing requests for a simple user service.
// User service trace
func getUser(ctx context.Context, userID string) {
span, ctx := tracer.StartSpanFromContext(ctx, "getUser")
defer span.End()
span.SetAttributes(
attribute.String("user.id", userID),
attribute.String("http.method", "GET"),
attribute.String("http.url", "/users/"+userID), // Problematic!
)
// ... fetch user from DB ...
span.SetAttributes(
attribute.Int("db.rows_affected", 1),
)
}
The http.url attribute here is where the trouble starts. If userID can be any arbitrary string (e.g., user123, user456, user_abc_xyz), each unique userID creates a new, distinct value for http.url. Tempo, like many tracing backends, indexes these attribute values to allow for efficient querying. When you have millions of unique values for a single attribute, that’s high cardinality.
The Problem: Query Performance and Storage Costs
High cardinality directly impacts Tempo’s ability to query traces. When you search for traces, Tempo needs to scan through indexed attribute values. If there are millions of unique values for an attribute like http.url, the index becomes massive, and queries become slow, potentially timing out. Furthermore, storing these extensive indexes consumes significant disk space, driving up operational costs.
The Solution: Attribute Normalization and Reduction
The core strategy for managing high cardinality is to normalize or reduce the number of unique attribute values. This means transforming values that are functionally the same but syntactically different into a consistent, limited set of values.
1. Identify High Cardinality Attributes:
The first step is to find out which attributes are causing the problem. Tempo has a built-in way to report on this.
-
Diagnosis: Use the Tempo Query API’s
cardinalityendpoint. You can query this viacurlor directly in Grafana’s Explore view.curl -G "http://localhost:3200/api/search" \ --data-urlencode 'query={service="user-service"}' \ --data-urlencode 'start=1678886400000' \ --data-urlencode 'end=1678972800000' \ --data-urlencode 'limit=10000' \ --data-urlencode 'direction=forwards' \ --data-urlencode 'trace-id=...' # Optional: filter by specific trace ID(Note: The above
searchendpoint is for general trace searching. For explicit cardinality metrics, you’d typically look at your Prometheus metrics from Tempo itself, which exposestempo_ingester_cardinality_labels_totalandtempo_ingester_cardinality_label_values_total.)More practically, you’ll observe slow queries or high resource utilization on your Tempo instances. In Grafana, try querying for traces with specific, potentially high-cardinality attributes. If queries are slow or fail, that’s your signal.
2. Normalize Dynamic Values:
For attributes like http.url where parts are dynamic (e.g., userID), replace the dynamic part with a placeholder or a generic value.
- Diagnosis: Review your application code or service mesh configuration (e.g., Istio, Linkerd) that generates spans. Look for attributes populated with user-provided data, IDs, or other highly variable information.
- Fix: Modify your tracing instrumentation. Instead of setting
http.urlto/users/12345, set it to a more generic pattern.
Or, if using a service mesh, configure it to emit normalized routes. For Istio, this might involve configuring Envoy filters.// In your Go application code: span.SetAttributes( attribute.String("user.id", userID), attribute.String("http.method", "GET"), attribute.String("http.route", "/users/{id}"), // Use a route template ) - Why it works: By using a fixed string like
/users/{id}for all user lookups, you reduce millions of potentialhttp.urlvalues down to a single indexed value, drastically shrinking the cardinality.
3. Filter Out Unnecessary Attributes:
Some attributes might be useful for debugging in development but are too high in cardinality or simply not needed for long-term analysis in production.
- Diagnosis: Again, code review and understanding what data is being added to spans. Are you adding
request_idorcorrelation_idas a span attribute? These are often unique per request. - Fix: Remove the instrumentation that adds these high-cardinality attributes.
Alternatively, configure your OpenTelemetry SDK or agent to drop specific attributes.// In your Go application code: // REMOVE THIS LINE if 'request_id' is causing high cardinality // span.SetAttributes(attribute.String("request_id", generateRequestID()))# Example OpenTelemetry Collector configuration snippet processors: attributes/drop_high_cardinality: actions: - key: request_id action: delete - Why it works: By not sending the high-cardinality attribute to Tempo at all, you prevent it from being indexed and stored, thereby eliminating the cardinality issue at its source.
4. Use Consistent Data Types and Values:
Ensure that similar concepts are represented by the same attribute key and, where possible, consistent value formats.
- Diagnosis: Look for variations in attribute names (e.g.,
userId,user_id,UserID) or value formats (e.g.,true,True,1for boolean true). - Fix: Standardize your attribute naming and value representation. Use the OpenTelemetry semantic conventions as a guide. For boolean values, consistently use
trueorfalse(as strings or actual boolean types if supported by your instrumentation).// Consistent usage: span.SetAttributes( attribute.String("user.id", userID), attribute.Bool("user.is_active", true), ) - Why it works: This prevents Tempo from treating
user_idanduserIdas different attributes, ortrueandTrueas different values for the same attribute. It consolidates the indexing.
5. Leverage Service Names and Operation Names:
These are fundamental to tracing and are typically low cardinality. Ensure they are used effectively.
- Diagnosis: Check your
service.nameandspan.nameattributes. Are they descriptive but not overly granular? - Fix: Ensure your
service.nameaccurately reflects the microservice, andspan.namedescribes the operation (e.g.,HTTP GET /users/{id},User.GetUser). Avoid makingspan.nameunique per request.// Good practice: span, ctx := tracer.StartSpanFromContext(ctx, "HTTP GET /users/{id}") span.SetAttributes( attribute.String("service.name", "user-service"), attribute.String("user.id", userID), ) - Why it works:
service.nameandspan.nameare often the primary dimensions for filtering and aggregation. Keeping them low cardinality ensures these core operations remain fast.
6. Sampling (A Last Resort for Cardinality):
While primarily a strategy for reducing trace volume, aggressive sampling can indirectly help with cardinality by reducing the number of attributes being sent. However, this is usually not the primary fix for high cardinality attribute values.
- Diagnosis: If you’ve exhausted other options and are still facing issues, consider if sampling is appropriate for your use case.
- Fix: Implement a tail-based or head-based sampling strategy using your OpenTelemetry SDK or Collector.
# Example OpenTelemetry Collector sampling processor processors: tail_sampling: policies: - name: error-sampling type: status_code status_code: code: "ERROR" - name: trace-id-ratio-sampling type: trace_id_ratio trace_id_ratio: ratio: 0.1 # Sample 10% of traces - Why it works: Fewer traces being sent means fewer attribute values being indexed and stored. However, this can lead to missing traces, so it’s a trade-off.
The next challenge you’ll likely encounter is efficiently querying traces across a distributed system, even with low cardinality attributes.