High cardinality in Grafana Tempo isn’t about the number of traces, but the variety of attribute values within those traces.

Let’s see it in action. Imagine you’re tracing requests for a simple user service.

// User service trace
func getUser(ctx context.Context, userID string) {
	span, ctx := tracer.StartSpanFromContext(ctx, "getUser")
	defer span.End()

	span.SetAttributes(
		attribute.String("user.id", userID),
		attribute.String("http.method", "GET"),
		attribute.String("http.url", "/users/"+userID), // Problematic!
	)

	// ... fetch user from DB ...

	span.SetAttributes(
		attribute.Int("db.rows_affected", 1),
	)
}

The http.url attribute here is where the trouble starts. If userID can be any arbitrary string (e.g., user123, user456, user_abc_xyz), each unique userID creates a new, distinct value for http.url. Tempo, like many tracing backends, indexes these attribute values to allow for efficient querying. When you have millions of unique values for a single attribute, that’s high cardinality.

The Problem: Query Performance and Storage Costs

High cardinality directly impacts Tempo’s ability to query traces. When you search for traces, Tempo needs to scan through indexed attribute values. If there are millions of unique values for an attribute like http.url, the index becomes massive, and queries become slow, potentially timing out. Furthermore, storing these extensive indexes consumes significant disk space, driving up operational costs.

The Solution: Attribute Normalization and Reduction

The core strategy for managing high cardinality is to normalize or reduce the number of unique attribute values. This means transforming values that are functionally the same but syntactically different into a consistent, limited set of values.

1. Identify High Cardinality Attributes:

The first step is to find out which attributes are causing the problem. Tempo has a built-in way to report on this.

  • Diagnosis: Use the Tempo Query API’s cardinality endpoint. You can query this via curl or directly in Grafana’s Explore view.

    curl -G "http://localhost:3200/api/search" \
      --data-urlencode 'query={service="user-service"}' \
      --data-urlencode 'start=1678886400000' \
      --data-urlencode 'end=1678972800000' \
      --data-urlencode 'limit=10000' \
      --data-urlencode 'direction=forwards' \
      --data-urlencode 'trace-id=...' # Optional: filter by specific trace ID
    

    (Note: The above search endpoint is for general trace searching. For explicit cardinality metrics, you’d typically look at your Prometheus metrics from Tempo itself, which exposes tempo_ingester_cardinality_labels_total and tempo_ingester_cardinality_label_values_total.)

    More practically, you’ll observe slow queries or high resource utilization on your Tempo instances. In Grafana, try querying for traces with specific, potentially high-cardinality attributes. If queries are slow or fail, that’s your signal.

2. Normalize Dynamic Values:

For attributes like http.url where parts are dynamic (e.g., userID), replace the dynamic part with a placeholder or a generic value.

  • Diagnosis: Review your application code or service mesh configuration (e.g., Istio, Linkerd) that generates spans. Look for attributes populated with user-provided data, IDs, or other highly variable information.
  • Fix: Modify your tracing instrumentation. Instead of setting http.url to /users/12345, set it to a more generic pattern.
    // In your Go application code:
    span.SetAttributes(
    	attribute.String("user.id", userID),
    	attribute.String("http.method", "GET"),
    	attribute.String("http.route", "/users/{id}"), // Use a route template
    )
    
    Or, if using a service mesh, configure it to emit normalized routes. For Istio, this might involve configuring Envoy filters.
  • Why it works: By using a fixed string like /users/{id} for all user lookups, you reduce millions of potential http.url values down to a single indexed value, drastically shrinking the cardinality.

3. Filter Out Unnecessary Attributes:

Some attributes might be useful for debugging in development but are too high in cardinality or simply not needed for long-term analysis in production.

  • Diagnosis: Again, code review and understanding what data is being added to spans. Are you adding request_id or correlation_id as a span attribute? These are often unique per request.
  • Fix: Remove the instrumentation that adds these high-cardinality attributes.
    // In your Go application code:
    // REMOVE THIS LINE if 'request_id' is causing high cardinality
    // span.SetAttributes(attribute.String("request_id", generateRequestID()))
    
    Alternatively, configure your OpenTelemetry SDK or agent to drop specific attributes.
    # Example OpenTelemetry Collector configuration snippet
    processors:
      attributes/drop_high_cardinality:
        actions:
          - key: request_id
            action: delete
    
  • Why it works: By not sending the high-cardinality attribute to Tempo at all, you prevent it from being indexed and stored, thereby eliminating the cardinality issue at its source.

4. Use Consistent Data Types and Values:

Ensure that similar concepts are represented by the same attribute key and, where possible, consistent value formats.

  • Diagnosis: Look for variations in attribute names (e.g., userId, user_id, UserID) or value formats (e.g., true, True, 1 for boolean true).
  • Fix: Standardize your attribute naming and value representation. Use the OpenTelemetry semantic conventions as a guide. For boolean values, consistently use true or false (as strings or actual boolean types if supported by your instrumentation).
    // Consistent usage:
    span.SetAttributes(
    	attribute.String("user.id", userID),
    	attribute.Bool("user.is_active", true),
    )
    
  • Why it works: This prevents Tempo from treating user_id and userId as different attributes, or true and True as different values for the same attribute. It consolidates the indexing.

5. Leverage Service Names and Operation Names:

These are fundamental to tracing and are typically low cardinality. Ensure they are used effectively.

  • Diagnosis: Check your service.name and span.name attributes. Are they descriptive but not overly granular?
  • Fix: Ensure your service.name accurately reflects the microservice, and span.name describes the operation (e.g., HTTP GET /users/{id}, User.GetUser). Avoid making span.name unique per request.
    // Good practice:
    span, ctx := tracer.StartSpanFromContext(ctx, "HTTP GET /users/{id}")
    span.SetAttributes(
    	attribute.String("service.name", "user-service"),
    	attribute.String("user.id", userID),
    )
    
  • Why it works: service.name and span.name are often the primary dimensions for filtering and aggregation. Keeping them low cardinality ensures these core operations remain fast.

6. Sampling (A Last Resort for Cardinality):

While primarily a strategy for reducing trace volume, aggressive sampling can indirectly help with cardinality by reducing the number of attributes being sent. However, this is usually not the primary fix for high cardinality attribute values.

  • Diagnosis: If you’ve exhausted other options and are still facing issues, consider if sampling is appropriate for your use case.
  • Fix: Implement a tail-based or head-based sampling strategy using your OpenTelemetry SDK or Collector.
    # Example OpenTelemetry Collector sampling processor
    processors:
      tail_sampling:
        policies:
          - name: error-sampling
            type: status_code
            status_code:
              code: "ERROR"
          - name: trace-id-ratio-sampling
            type: trace_id_ratio
            trace_id_ratio:
              ratio: 0.1 # Sample 10% of traces
    
  • Why it works: Fewer traces being sent means fewer attribute values being indexed and stored. However, this can lead to missing traces, so it’s a trade-off.

The next challenge you’ll likely encounter is efficiently querying traces across a distributed system, even with low cardinality attributes.

Want structured learning?

Take the full Tempo course →