The primary goal of consensus algorithms isn’t to achieve agreement, but to do so reliably in the face of network partitions and node failures.

Let’s see Raft in action. Imagine a simple distributed key-value store. We have three nodes: node1, node2, node3.

Initial State: All nodes are followers, waiting for a leader.

node1: {state: follower, term: 0, votedFor: null, log: []}
node2: {state: follower, term: 0, votedFor: null, log: []}
node3: {state: follower, term: 0, votedFor: null, log: []}

A timeout occurs on node1. It transitions to candidate, increments its term, and votes for itself.

node1: {state: candidate, term: 1, votedFor: 1, log: []}
node2: {state: follower, term: 0, votedFor: null, log: []}
node3: {state: follower, term: 0, votedFor: null, log: []}

node1 sends RequestVote RPCs to node2 and node3.

Scenario 1: node1 becomes leader

node2 and node3 receive the RequestVote RPC. Their logs are empty and their terms are lower, so they grant their vote to node1 and transition to followers.

node1: {state: leader, term: 1, votedFor: 1, log: []}
node2: {state: follower, term: 1, votedFor: 1, log: []}
node3: {state: follower, term: 1, votedFor: 1, log: []}

node1 is now the leader for term 1. It starts sending AppendEntries heartbeats to node2 and node3 to maintain its leadership and signal that it’s alive.

Scenario 2: A network partition occurs

Suppose node1 and node2 can communicate, but node3 is isolated. node1 and node2 elect node1 as leader for term 1. node3 times out and becomes a candidate for term 2.

node1: {state: leader, term: 1, votedFor: 1, log: []}
node2: {state: follower, term: 1, votedFor: 1, log: []}
node3: {state: candidate, term: 2, votedFor: 3, log: []} // Partitioned, starts new term

node3 sends RequestVote RPCs to node1 and node2.

  • node1 receives the RequestVote for term 2. Its current term is 1, so it rejects the vote and replies with its current term (1).
  • node2 receives the RequestVote for term 2. Its current term is 1, so it rejects the vote and replies with its current term (1).

node3 doesn’t get a majority and times out, becoming a candidate again for term 3.

Meanwhile, node1 (the leader) receives a client request to set key=value.

node1 appends (set, key=value) to its log.

node1: {state: leader, term: 1, votedFor: 1, log: [(set, key=value)]}
node2: {state: follower, term: 1, votedFor: 1, log: []}
node3: {state: candidate, term: 3, votedFor: 3, log: []} // Still partitioned

node1 sends AppendEntries to node2 including the new entry.

  • node2 receives the AppendEntries. Its log is empty, so it appends the entry and replies with success.
node1: {state: leader, term: 1, votedFor: 1, log: [(set, key=value)]}
node2: {state: follower, term: 1, votedFor: 1, log: [(set, key=value)]}
node3: {state: candidate, term: 3, votedFor: 3, log: []} // Still partitioned

Now, node1 has replicated the entry to a majority (node1 and node2). It commits the entry and applies it to its state machine.

node1: {state: leader, term: 1, votedFor: 1, log: [(set, key=value)], commitIndex: 1, stateMachine: {key: value}}
node2: {state: follower, term: 1, votedFor: 1, log: [(set, key=value)], commitIndex: 0, stateMachine: {}}
node3: {state: candidate, term: 3, votedFor: 3, log: []} // Still partitioned

node1 then sends an AppendEntries response to the client indicating success. It also sends heartbeats to node2, which now also has the committed entry.

When the partition heals, node3 will receive heartbeats from node1 with a higher term (e.g., term 4). It will step down to follower, update its term, and realize its log is behind. It will request entries from node1 and catch up.

The core problem these algorithms solve is consistency in a distributed system where messages can be delayed or lost, and nodes can fail. They do this by establishing a single, ordered log of operations that all nodes agree upon.

Raft, Paxos, and ZAB (used by ZooKeeper) are different implementations of this core idea. Raft is generally considered easier to understand and implement because it separates leadership election from log replication more distinctly than Paxos. ZAB is optimized for ZooKeeper’s specific use case of maintaining a consistent configuration and coordination service.

The key mechanism is that a leader must replicate an entry to a majority of nodes before it can be considered committed. This "majority" rule is what guarantees that once an entry is committed, it will never be lost, even if the leader fails. If a new leader is elected, it will have at least one entry from the committed log of the previous leader because it needed a majority to commit it in the first place.

When a follower receives an AppendEntries RPC from a leader with a higher term, it must update its own term and become a follower. This is the fundamental mechanism that prevents stale leaders from causing inconsistencies. If a node receives a vote request for a term it has already seen a leader for, it will reject the vote.

The most surprising part of these algorithms is how they handle log divergence. A follower whose log doesn’t match the leader’s will simply have entries deleted from its log until it is consistent with the leader’s. This isn’t an error; it’s a feature. The leader’s log is the source of truth, and followers are expected to conform.

The next hurdle is understanding how these algorithms handle leader failures and re-elections, and the implications for client request handling.

Want structured learning?

Take the full System Design course →