Consistent hashing is a way to distribute data across a set of nodes such that when nodes are added or removed, the amount of data that needs to be moved is minimized.
Let’s see how this plays out in a real-world scenario. Imagine you have a distributed cache, like Memcached or Redis, serving requests from your web application. Without consistent hashing, if you add a new cache server, you’d have to re-map almost all your keys to the new server, potentially overwhelming your system with data migration. With consistent hashing, only a small fraction of keys need to be moved.
Here’s a simplified conceptual view of how it works. Instead of a simple modulo operation (hash(key) % num_servers) to determine which server a key belongs to, consistent hashing maps both keys and servers onto a conceptual ring.
+-----------------+
| |
| Server A (10) |
| |
+------+-----------------+------+
| |
| Server D (70) | Server B (30)
| |
+------+-----------------+------+
| |
| Server C (50) |
| |
+-----------------+
In this ring, we have servers A, B, C, and D. Each server is placed at one or more points on the ring based on its hash value. Keys are also hashed and placed on the ring. To find the server for a given key, you hash the key, find its position on the ring, and then move clockwise until you hit a server.
Let’s say we have keys with the following hash values:
-
key1: 15 -
key2: 40 -
key3: 60 -
key4: 85 -
key1(15) would map to Server A (10). -
key2(40) would map to Server B (30). -
key3(60) would map to Server C (50). -
key4(85) would map to Server D (70).
Now, what happens if we add Server E at position 45?
+-----------------+
| |
| Server A (10) |
| |
+------+-----------------+------+
| |
| Server D (70) Server E (45) | Server B (30)
| |
+------+-----------------+------+
| |
| Server C (50) |
| |
+-----------------+
key1(15) still maps to Server A.key2(40) now maps to Server E (45). This is the only key that had to move.key3(60) still maps to Server C.key4(85) still maps to Server D.
This is the core benefit: minimal data reshuffling.
To make this more robust and avoid hot spots, consistent hashing typically uses "virtual nodes." Instead of a server having one point on the ring, it has many. For example, Server A might have virtual nodes at positions 10, 110, and 210 (if the ring goes from 0 to 360). This distributes the load more evenly.
The actual implementation involves choosing a good hash function (like MurmurHash3 or SHA-1) and a data structure to efficiently find the next server on the ring. A sorted list or a balanced binary search tree of server hash values is common. When a key is hashed, you search this structure for the smallest server hash value greater than or equal to the key’s hash value. If no such server hash is found (meaning the key’s hash is larger than all server hashes), it wraps around to the first server on the ring.
The levers you control are the number of virtual nodes per physical node and the choice of hash function. More virtual nodes generally lead to better load distribution but increase the size of the ring’s lookup structure. The hash function’s quality impacts how evenly keys are spread.
A common misunderstanding is that consistent hashing guarantees perfect load balancing. It only guarantees minimal remapping upon node changes. Uneven distribution of keys themselves can still lead to imbalances, which is why virtual nodes are crucial and often a large number (e.g., 100-200) are used per physical server.
The next step in managing distributed data is often dealing with data replication for fault tolerance.