Grafana Tempo’s Write-Ahead Log (WAL) is surprisingly a performance bottleneck for high-throughput write workloads, not just a durability safety net.
Let’s see Tempo ingest some traces and watch its WAL in action. Imagine we’re sending traces from a busy microservice.
# Simulate sending traces to Tempo's /ingester endpoint
# In a real scenario, this would be your application generating traces.
# This is a simplified representation.
curl -X POST \
http://localhost:3200/ingester/api/v1/push \
-H "Content-Type: application/json" \
-d '{
"resourceSpans": [
{
"resource": {
"attributes": [
{"key": "service.name", "value": {"stringValue": "my-awesome-service"}}
]
},
"instrumentationLibrarySpans": [
{
"instrumentationLibrary": {"name": "opentelemetry-java", "version": "1.0.0"},
"spans": [
{
"traceId": "a1b2c3d4e5f67890a1b2c3d4e5f67890",
"spanId": "0102030405060708",
"name": "process_request",
"startTimeUnixNano": 1678886400000000000,
"endTimeUnixNano": 1678886400100000000,
"kind": "SPAN_KIND_SERVER"
}
]
}
]
}
]
}'
As this data hits the Tempo ingester, it doesn’t immediately write to its object storage. First, it writes to the WAL. This is a sequence of append-only files on disk. Think of it as a logbook where every operation is recorded before it’s finalized. This ensures that even if Tempo crashes mid-write, no data is lost because it’s already safely in the WAL. Once the data is in the WAL, Tempo acknowledges the write to the client. Later, background goroutines will read from the WAL and asynchronously push the data to object storage (like S3, GCS, or MinIO).
The problem is that the WAL is local disk, and it’s append-only. For high-throughput ingestion, the WAL can become a bottleneck. Tempo ingesters need to write to the WAL and then read from it to send to object storage. If the write rate to the WAL exceeds the rate at which data can be successfully persisted to object storage and then cleaned up from the WAL, the WAL directory can fill up, leading to ingestion failures.
Tempo’s WAL is managed by the go-wal library. It partitions the WAL into segments, typically 64MB in size by default. When a segment is full, it’s marked as "closed." A separate process then reads from these closed segments and uploads them to object storage. Once a segment is successfully uploaded to object storage and verified, the WAL segment file on disk is deleted.
The most common reason for WAL issues is simply an undersized WAL directory. When the WAL directory fills up, new writes to the ingester are rejected with errors like WAL segment full or WAL disk full. The default wal.segment_size is 64MB, and wal.dir is typically set to /tmp/tempo/wal. If you’re ingesting a lot of data, this can fill up quickly.
Diagnosis: Check the disk space usage of your Tempo WAL directory.
df -h /var/tempo/wal (adjust path if different)
Fix: Increase the size of the WAL directory or its underlying filesystem. A common recommendation for busy clusters is to ensure at least 100GB of free space for the WAL, often on a dedicated, fast SSD. Configuration:
ingester:
wal:
dir: /var/tempo/wal # Ensure this path is on a volume with ample space
segment_size: 64MB # Default, can be adjusted but often not the primary bottleneck
max_age: 168h # Default, controls how long WAL segments are kept before being deleted if not uploaded
Why it works: This provides more room for the WAL segments to accumulate before they are uploaded to object storage, reducing the chance of the directory filling up during temporary bursts of traffic or slower-than-expected object storage writes.
Another frequent culprit is the max_segment_size configuration. While segment_size is the target size for a single WAL file, max_segment_size (which defaults to 1GB) limits the total size of WAL files that the ingester will create before it starts waiting. If this is too small relative to your write throughput and object storage write latency, it can cause the ingester to pause writes.
Diagnosis: Review your Tempo configuration for ingester.wal.max_segment_size.
Fix: Increase ingester.wal.max_segment_size. For high-throughput environments, increasing this to 4GB or even 8GB can be beneficial.
Configuration:
ingester:
wal:
max_segment_size: 4GB # Increase from default 1GB
Why it works: A larger maximum segment size allows the ingester to accumulate more data locally before it needs to pause and wait for segments to be fully processed and uploaded, smoothing out write operations.
The max_age setting for WAL segments is also crucial. It dictates how long Tempo will retain WAL files even if they haven’t been successfully uploaded to object storage. If object storage becomes unavailable or extremely slow for an extended period, and max_age is too short, Tempo might delete WAL files that haven’t been persisted, leading to data loss.
Diagnosis: Check ingester.wal.max_age in your Tempo configuration.
Fix: Increase ingester.wal.max_age to a value that provides a sufficient buffer for object storage to catch up. For example, 240h (10 days) or 720h (30 days) are common.
Configuration:
ingester:
wal:
max_age: 720h # Increase from default 168h
Why it works: This gives the system more time to ensure that data is durably stored in object storage before the WAL files are eligible for deletion, acting as a longer-term safety net against object storage latency.
Object storage performance itself is a major factor. If your object storage (S3, GCS, etc.) has high latency or low throughput, the process of uploading WAL segments will slow down. This causes WAL segments to accumulate on disk faster than they can be cleared.
Diagnosis: Monitor your object storage’s latency and throughput metrics. Look for increased PutObject latency or reduced write throughput.
Fix: Optimize your object storage. This might involve choosing a different storage class, ensuring network connectivity is optimal, or scaling up your object storage service.
Why it works: Faster and more reliable writes to object storage directly translate to quicker WAL segment processing and deletion, preventing the WAL disk from filling up.
The number of ingester replicas also plays a role. If you have too few ingester replicas for your write load, each replica might be overwhelmed, leading to a backlog in their local WAL.
Diagnosis: Monitor the write throughput per ingester replica and compare it against the capacity of a single replica. Check tempo_ingester_wal_segment_sync_duration_seconds and tempo_ingester_wal_segment_sync_count metrics.
Fix: Increase the number of ingester replicas.
Configuration: Adjust replicas in your Kubernetes deployment or equivalent.
Why it works: Distributing the write load across more ingester instances means each instance handles less data, reducing the pressure on its local WAL and improving overall ingestion capacity.
Finally, the WAL segment sync duration metric (tempo_ingester_wal_segment_sync_duration_seconds) can indicate if the process of syncing WAL segments to object storage is taking too long. If this duration consistently increases, it points to an issue with object storage performance or network between Tempo and object storage.
Diagnosis: Observe the tempo_ingester_wal_segment_sync_duration_seconds metric in Grafana.
Fix: Investigate object storage performance and network connectivity. This might involve looking at cloud provider metrics for your object storage or using network troubleshooting tools.
Why it works: Addressing the root cause of slow WAL segment syncing directly resolves the backlog and prevents WAL disk exhaustion.
The next error you’ll likely see after fixing WAL issues is related to the distributor or querier failing to receive data because the ingester is back-pressuring.