Tempo’s hash ring is how it distributes traces across all your Tempo instances, ensuring no single node gets overloaded and that traces can be found even if some nodes go down.
Let’s watch it in action. Imagine you have three Tempo instances: tempo-01, tempo-02, and tempo-03. When a trace comes in, Tempo needs to decide which instance should store it. It doesn’t just pick one randomly. Instead, it uses a consistent hashing algorithm.
Here’s a simplified look at what happens when a trace ID a1b2c3d4e5f6 arrives at tempo-01. Tempo calculates a hash of this trace ID. This hash value falls somewhere on a virtual ring. The ring is divided into segments, and each Tempo instance "owns" a certain number of these segments. Let’s say, for this specific hash, tempo-02 owns the segment where a1b2c3d4e5f6 lands. So, tempo-01 forwards this trace data to tempo-02.
The real magic comes with replication. Tempo doesn’t just send the trace to one instance; it sends it to a configurable number of instances. If you’ve set replication_factor: 3, Tempo will ensure that trace ID a1b2c3d4e5f6 is stored on three different instances. So, after tempo-02 receives it, Tempo will determine the next two instances on the ring that own segments for this trace ID. If the ring ownership looks like tempo-02, tempo-03, tempo-01 for this trace, then tempo-02 stores it, and then forwards it to tempo-03 and tempo-01.
This replication is crucial for two main reasons: availability and consistency. If tempo-02 goes down, you can still retrieve the trace because tempo-03 and tempo-01 also have a copy. Consistency means that all replicas eventually agree on the data. Tempo uses a gossip protocol for the hash ring itself, allowing nodes to share information about who owns which part of the ring. When a new node joins or an existing node leaves, this information propagates, and the ring rebalances automatically.
The configuration for the hash ring is typically found within your Tempo configuration file, often under a distributor or ingester section (depending on your deployment and what component is managing the hash ring for data placement).
distributor:
# ... other distributor settings
hash_ring:
instance_addr: "tempo-01:8080" # The address of this Tempo instance
listen_addr: "0.0.0.0:8080" # Address for the hash ring gossip
# Number of replicas for each trace. A factor of 3 is common for high availability.
replication_factor: 3
# How often nodes gossip their ring state to others.
gossip_interval: "1s"
# How long to wait for a quorum of nodes before considering a write successful.
# This is key for consistency.
heartbeat_timeout: "1m"
# How many instances to consider for a hash ring "partition".
# Lower values can lead to more uneven distribution but faster rebalancing.
# Higher values lead to more even distribution but slower rebalancing.
# Default is often 27.
partition_size: 27
The replication_factor is the most direct lever you have for controlling redundancy. A higher factor means more storage and network overhead but better resilience. instance_addr is critical; it’s how other Tempo instances will find and communicate with this specific instance for hash ring purposes. If this is wrong, nodes won’t see each other, and the ring will break.
The heartbeat_timeout is a subtle but important setting. It dictates how long the distributor will wait for a quorum of replicas to acknowledge receipt of a trace before it considers the write successful. If this is too short and you have network latency or a slow replica, you might incorrectly assume a trace was lost when it was just delayed. Conversely, if it’s too long, your application might wait an excessive amount of time for a write to complete.
The actual mechanism for ensuring consistency across replicas, beyond the initial write acknowledgment, relies on the fact that all ingesters will eventually receive the same trace ID. If an ingester receives a trace ID that it believes should be owned by another ingester (based on the current ring state), it will forward it. This forwarding and eventual reconciliation ensure that all replicas for a given trace ID end up with the same data. The hash ring’s dynamic nature, with nodes joining and leaving, means that ownership of trace IDs shifts over time. When a node leaves, its responsibilities are redistributed to other nodes on the ring. This redistribution is what allows Tempo to remain available even during partial failures.
If you’re seeing issues with traces disappearing or not being found, double-check that instance_addr is correctly set on all your Tempo instances and that they can reach each other on the listen_addr specified for gossip. A common mistake is having internal cluster IPs or hostnames that aren’t resolvable or reachable between nodes.
The next challenge you’ll likely encounter is understanding how the "read path" uses the hash ring to locate traces across your distributed system.