The most surprising thing about leader election in distributed systems is that it’s not about finding a leader, but about agreeing on who the leader is, even when parts of the system are failing.

Imagine you have a cluster of servers, and for certain operations, you need one designated "leader" to coordinate everything. If that leader crashes, the remaining servers need to quickly and reliably pick a new one. This is leader election.

Let’s look at Raft, a popular consensus algorithm for leader election. Raft works by dividing time into "terms." Each term has at most one leader. Servers start as "followers." When a follower times out waiting for a heartbeat from its leader, it becomes a "candidate" and starts a new term.

A candidate then sends "RequestVote" RPCs to other servers. If a candidate receives votes from a majority of servers in the current term, it becomes the leader for that term. If it discovers another server has already become leader (by receiving an AppendEntries RPC from it), it reverts to being a follower. If multiple candidates start elections in the same term and neither gets a majority, the term ends, and servers start a new election with a randomized election timeout. This randomized timeout is crucial; it prevents split votes in subsequent terms.

Here’s a simplified Raft configuration snippet from a hypothetical system:

raft:
  cluster_id: "my-data-cluster-1"
  node_id: "node-3"
  listen_address: "192.168.1.3:2380"
  peers:
    - "192.168.1.1:2380"
    - "192.168.1.2:2380"
    - "192.168.1.3:2380"
  election_timeout_min: "150ms"
  election_timeout_max: "300ms"
  heartbeat_interval: "50ms"

In this setup, node-3 is part of a cluster identified by my-data-cluster-1. It listens for Raft communication on 192.168.1.3:2380 and knows about its peers. The election_timeout_min and election_timeout_max define the range for randomized timeouts, and heartbeat_interval is how often the leader sends "heartbeats" to followers. If a follower doesn’t receive a heartbeat within its election timeout, it assumes the leader is down and starts an election.

Now, consider ZooKeeper, which also handles distributed coordination and leader election, but with a slightly different approach. ZooKeeper uses a protocol called ZAB (ZooKeeper Atomic Broadcast). ZAB is similar to Raft in its goal of achieving consensus but has its own nuances.

In ZooKeeper, servers also transition between states: LOOKING, FOLLOWING, LEADING, and OBSERVING. When a server starts up or detects a leader failure, it enters the LOOKING state. In this state, it broadcasts a PROPOSAL containing its last committed transaction ID (zxid) and then waits for other servers’ ACK messages.

A server becomes the leader if it receives ACKs from a majority of the ensemble, and its zxid is the most up-to-date. If multiple servers propose themselves as leader with the same highest zxid, a tie-breaking mechanism based on server ID is used.

Here’s a snippet of ZooKeeper server configuration:

# ZooKeeper server.properties
tickTime=2000
initLimit=10
syncLimit=5
server.1=192.168.1.1:2888:3888
server.2=192.168.1.2:2888:3888
server.3=192.168.1.3:2888:3888

In this server.properties file:

  • tickTime is a basic time unit for heartbeats and timeouts.
  • initLimit is the timeout for followers to connect to the leader.
  • syncLimit is the timeout for followers to sync with the leader.
  • server.X=host:port1:port2 defines each server in the ensemble. port1 is for follower-leader communication, and port2 is for leader election.

The election process in ZooKeeper involves these ports. When a server is in the LOOKING state, it uses port2 (3888 in the example) to send and receive election packets. It broadcasts its vote (including its own zxid and server ID) and listens for votes from others.

One key aspect of ZooKeeper’s leader election is its reliance on Quorum. A quorum is a majority of servers. For ZooKeeper to make any progress (like electing a leader or committing a transaction), a quorum of servers must agree. This ensures that even if some servers are down, the remaining ones can still make decisions reliably. The election process itself requires a quorum to acknowledge a proposed leader. If a server receives votes from a majority of the ensemble, and its proposed leader is part of that majority (or it is the leader itself and has received votes from a majority), it assumes leadership.

The real magic in these systems is how they handle network partitions. If the network splits, and a minority partition thinks it has elected a leader, that leader cannot reach a quorum of the entire ensemble. Therefore, it cannot make progress on state changes, preventing divergent states and ensuring consistency once the partition heals. The leader election is just the first step; maintaining leadership and agreeing on the state are the ongoing challenges.

The next challenge you’ll face is ensuring that once a leader is elected, all state changes are reliably replicated to a quorum of followers before being committed.

Want structured learning?

Take the full System Design course →