Tempo’s trace query performance tanked because the underlying object store, often S3, became a bottleneck, and its internal caching mechanisms were insufficient for the query patterns.

Common Causes and Fixes

1. Insufficient Object Store Cache Size: Tempo aggressively caches trace data locally to reduce S3 calls. If this cache fills up too quickly or is too small, it leads to constant S3 reads.

  • Diagnosis: Monitor Tempo’s memory usage. Look for high tempo_cache_misses_total metrics. Check Tempo’s configuration for cache.size.
  • Fix: Increase the cache.size in your Tempo configuration. For example, if it’s set to 100MB, try 512MB or 1GB.
    cache:
      size: 512MB # Or 1GB, depending on available RAM
    
  • Why it works: A larger cache allows Tempo to hold more recently accessed trace data in memory, significantly reducing the need to fetch it from S3 for repeated queries.

2. Inefficient Object Store Connection Pooling: Tempo uses HTTP connections to interact with object stores. If the connection pool is too small or misconfigured, it can lead to connection setup overhead and delays.

  • Diagnosis: Monitor Tempo’s network activity and object store metrics. Look for high latency on object store requests. Check Tempo’s configuration for object_store.s3.connection_pool_size.
  • Fix: Increase object_store.s3.connection_pool_size. A common starting point is 100.
    object_store:
      s3:
        connection_pool_size: 100
    
  • Why it works: A larger connection pool allows Tempo to reuse existing connections to the object store, avoiding the overhead of establishing new connections for each request and improving throughput.

3. High Cardinality Tagging in Traces: While useful, very high cardinality tags (e.g., unique user IDs in every trace) can explode the number of index entries Tempo needs to scan, making queries slow.

  • Diagnosis: Analyze your traces. Identify tags that have an extremely large number of unique values. Use Tempo’s query explorer to see how many series are scanned for typical queries.
  • Fix: Reduce the cardinality of your tags. This might involve sampling, using less granular tags, or configuring Tempo to ignore certain high-cardinality tags for indexing. If using the tempo-distributed configuration, ensure your ingester and querier configurations properly reflect these decisions. For example, you might need to adjust the index.local.enabled or index.remote.enabled settings if you’re using an external index like Loki.
    # Example: If using local index and want to disable certain tag indexing
    index:
      local:
        enabled: true
        # Consider advanced configurations for tag filtering if possible
    
  • Why it works: Fewer unique index entries mean less data to sift through when a query needs to find traces matching specific criteria, dramatically speeding up retrieval.

4. Inadequate Object Store Performance (S3 Throttling/Latency): The object store itself might be the bottleneck. S3 can throttle requests if you hit certain API limits, or network latency to S3 can be high.

  • Diagnosis: Check S3 metrics for 4xx and 5xx error rates, especially ThrottledRequests. Monitor network latency from your Tempo pods to the S3 endpoint.
  • Fix:
    • For Throttling: Increase your S3 request rates by contacting AWS support or optimizing your access patterns. Consider using S3 Intelligent-Tiering if your access patterns are unpredictable.
    • For Latency: Deploy Tempo instances in the same AWS region as your S3 bucket. Use S3 Transfer Acceleration if cross-region access is unavoidable, though this adds cost.
  • Why it works: Directly addresses the underlying infrastructure performance, ensuring Tempo can read data from its storage quickly and without being rate-limited.

5. Incorrect Object Store Configuration (Region, Endpoint): A simple but common mistake is having Tempo configured to talk to the wrong S3 region or an incorrect S3 endpoint, leading to increased latency or outright connection failures.

  • Diagnosis: Verify the region and endpoint settings in your Tempo object_store configuration against your actual S3 bucket’s location.
  • Fix: Correct the region and endpoint in your Tempo configuration to match your S3 bucket.
    object_store:
      s3:
        region: us-east-1 # Example: Ensure this matches your bucket's region
        endpoint: s3.amazonaws.com # Or your custom endpoint if applicable
    
  • Why it works: Ensures Tempo is communicating with the object store over the shortest, most direct network path, minimizing latency.

6. Tempo Version or Build Issues: Older versions of Tempo might have known performance regressions or lack optimizations that have been introduced in newer releases.

  • Diagnosis: Check the version of Tempo you are running. Review the Tempo release notes for performance-related bug fixes or improvements.
  • Fix: Upgrade to the latest stable version of Grafana Tempo.
    # Example using Helm to upgrade
    helm upgrade tempo grafana/tempo -n tempo --version <latest-version>
    
  • Why it works: Newer versions often include performance enhancements, bug fixes, and optimizations that can directly resolve slow query issues.

The next error you’ll likely encounter if all these are fixed is a sudden surge in Prometheus alert notifications about Tempo’s tempo_ingester_backend_head_chunks metric, indicating your ingestion rate might be pushing the limits of your configured storage or network.

Want structured learning?

Take the full Tempo course →