Tempo is dropping traces because they’re too big, and you’re missing data.
Here’s why that happens and how to fix it.
Tempo has an internal limit on the size of a single trace it will accept. When a trace exceeds this limit, Tempo simply discards it. This isn’t a configurable parameter you can easily bump up, because it’s fundamentally tied to how Tempo batches and processes traces internally to avoid overwhelming downstream systems and its own storage. The default limit is usually around 1MB, but it can vary.
The most common culprit is a single, long-running request that generates an enormous number of spans. Think of a deeply nested microservice call chain, or a single operation that’s highly chatty with many intermediate steps.
Common Causes and Fixes
-
Excessive Spans from a Single Operation:
- Diagnosis: Use
tempo-clito inspect traces. If you suspect a particular service or operation, you can query Tempo for traces originating from that service and check their sizes.
Alternatively, if you have Prometheus metrics enabled for Tempo, look fortempo-cli query --trace-id <trace-id>tempo_distributor_trace_size_bytesandtempo_distributor_too_large_traces_total. High values for the latter indicate the problem. - Fix: This is a code-level problem.
- Reduce Span Granularity: Instrument your code to create fewer, more meaningful spans. Instead of spanning every tiny operation within a function, span the function call itself or key logical steps.
- Batching/Aggregation: If a single operation inherently produces many small events, consider if some of these can be batched or aggregated into a single span at the parent level.
- Sampling: Implement client-side or server-side sampling to only send a fraction of traces, especially from noisy operations.
- Why it works: Fewer spans mean less data to serialize and transmit, keeping the trace size within limits.
- Diagnosis: Use
-
Large Span Attributes/Tags:
- Diagnosis: Examine the spans within a large trace (if you can retrieve a partially large one) for extremely large string attributes. This could be things like lengthy JSON payloads, stack traces embedded as tags, or large amounts of contextual data.
- Fix:
- Sanitize Attributes: Remove overly verbose or redundant attributes. Avoid embedding entire request/response bodies as tags.
- Use Events/Logs: For large payloads or detailed event information, use Tempo’s
SpanEventAPI instead of span attributes. Events are designed to hold more data without inflating the core span size as dramatically. - Compression (Client-side): If you have control over the tracing SDK, explore if it supports any form of attribute compression before sending.
- Why it works: Reducing the size of individual span data points prevents the cumulative trace size from exceeding the limit.
-
Undesired
SpanKindor Default Instrumentation:- Diagnosis: Some tracing libraries or frameworks might instrument operations you don’t necessarily need to trace at a granular level, especially if they are very frequent. Look at the types of spans being generated.
- Fix:
- Configure Tracing SDKs: Explicitly configure your tracing SDKs to exclude certain methods, libraries, or
SpanKinds (likeSERVERorCLIENTspans for very routine internal calls) that are generating excessive, low-value spans. - Custom Instrumentation: For critical or high-volume paths, write custom instrumentation that is more selective about what it spans.
- Configure Tracing SDKs: Explicitly configure your tracing SDKs to exclude certain methods, libraries, or
- Why it works: By turning off tracing for less important, high-frequency operations, you reduce the overall number of spans generated.
-
Network Latency/Retries Amplifying Span Counts:
- Diagnosis: In distributed systems, transient network issues or retries can cause a single logical operation to appear as multiple, distinct spans (especially if the client retries and creates new spans for each attempt).
- Fix:
- Client-side Retry Logic: Ensure your tracing instrumentation correctly associates retries with a single logical operation. Some SDKs have built-in support for this. If not, you might need to manually manage span contexts across retries.
- Server-side Span Creation: On the receiving end, ensure that a single request doesn’t trigger a cascade of new spans for every small internal step if it’s already being handled by a parent span.
- Why it works: Consolidating retry attempts under a single span context prevents the trace from growing disproportionately due to transient failures.
-
Improper
TraceIDPropagation:- Diagnosis: If
TraceIDs are not correctly propagated between services, each service might start a new trace or create a new root span. This can lead to many disconnected, small traces or, in rare cases, a very large, fragmented trace if one service incorrectly collects spans from different logical operations under a singleTraceID. - Fix:
- Standardize Propagation: Ensure all services use a consistent tracing context propagation mechanism (e.g., W3C Trace Context headers, B3 headers).
- Verify SDK Configuration: Double-check the configuration of your tracing SDKs in each service to confirm they are set up to receive and forward the correct propagation headers.
- Why it works: Correct propagation ensures all spans belonging to a single logical operation share the same
TraceID, preventing fragmentation and ensuring accurate trace assembly.
- Diagnosis: If
-
Tempo Distributor Configuration (Less Common for Size Limit):
- Diagnosis: While Tempo doesn’t have a direct
max_trace_sizeconfiguration, thedistributorcomponent has internal buffers and processing limits. If your Tempo instance is under extreme load or has very low memory/CPU, the distributor might struggle to process large traces efficiently, potentially leading to dropped data, though this usually manifests as other errors first. - Fix:
- Scale Tempo Components: Ensure your Tempo distributor, ingester, and querier components are adequately resourced (CPU, memory) and scaled according to your traffic volume.
- Check Tempo Logs: Look for errors in the Tempo distributor logs related to "out of memory," "context canceled," or "request too large" (even if not directly trace size).
- Why it works: Adequate resources allow Tempo’s internal mechanisms to handle the incoming trace data, including larger traces, without failing.
- Diagnosis: While Tempo doesn’t have a direct
The next error you’ll likely see is a surge in tempo_distributor_too_large_traces_total metric or a generic "dropped trace" log in the distributor, and a corresponding lack of traces in Grafana for the affected operations.