The most surprising thing about Splunk Search Head Cluster (SHC) captain election is that it’s entirely based on a distributed consensus algorithm (Raft) that prioritizes availability over strict consistency, meaning there are brief moments where the cluster might not have a single, universally agreed-upon captain.

Let’s watch a captain election happen. Imagine you have a Splunk SHC with three search heads: sh1, sh2, and sh3. They’re all configured to be part of the same cluster, sharing configuration bundles from a deployment server and communicating via a shared DFS (Distributed File System) like NFS or SMB.

When the cluster starts up, or if the current captain suddenly disappears (e.g., sh1 crashes), the remaining search heads enter an election.

Here’s a simplified view of what happens:

  1. No Captain: sh2 and sh3 realize there’s no active captain. They both have a "heartbeat" entry in the DFS for the captain. If that entry is old or missing, they initiate an election.
  2. Candidacy: sh2 and sh3 both declare themselves as candidates. They increment an "election term" counter and broadcast "Vote Request" messages to each other.
  3. Voting: Each candidate asks the other to vote for them. A search head will vote for a candidate if that candidate’s election term is greater than or equal to its own current term, and if it hasn’t already voted in this term.
  4. Majority Wins: If a candidate receives votes from a majority of the search heads (in this case, 2 out of 3), it becomes the new captain. Let’s say sh2 wins.
  5. Leadership: sh2 starts sending out "Append Entries" messages (its heartbeat) to sh3. This establishes sh2 as the captain. sh3 now recognizes sh2 as the leader.

The System in Action (Simplified Configuration Snippets):

On each search head, you’ll have server.conf configured for clustering:

# On sh1, sh2, sh3
[clustering]
    # The name of the SHC
    cluster_label = my_shc_cluster
    # The DFS path where cluster state is stored
    dfs_path = file:///opt/splunk/etc/cluster/sharedconfig
    # The member name for this search head
    member_uri = https://sh1.example.com:8089

And deploymentclient.conf to get configurations:

# On sh1, sh2, sh3
[deployment-client]
    # The URI of the deployment server
    phonehome_interval = 10
    # The URI of the deployment server
    server_uri = https://deploymentserver.example.com:8089

The Mental Model:

The SHC is built on the principle of distributed consensus. Think of it like a committee trying to agree on a decision, but the committee members can disappear or refuse to talk. The Raft algorithm is what allows them to reliably pick a leader and keep configurations synchronized even with network issues or node failures.

  • Captain: The single search head responsible for managing the cluster’s configuration bundle and ensuring all search heads are in sync. It’s the "source of truth" for metadata.
  • Configuration Bundles: When you make changes on the deployment server (like adding an app or modifying inputs.conf), these changes are bundled. The captain is responsible for distributing this bundle to all other search heads in the cluster.
  • DFS (Distributed File System): A shared storage location (often NFS or SMB) that all search heads can access. This is crucial for storing the active configuration bundle and for the captain to publish its state. It’s the shared whiteboard.
  • Heartbeats: The captain periodically writes a "heartbeat" or "leader lease" to the DFS. Other search heads monitor this. If the heartbeat is missing or too old, they assume the captain is down and trigger an election.
  • Replication: Once a captain is elected, it pushes the current configuration bundle to all other search heads. These other search heads (members) then apply the bundle. If a member falls behind, the captain can re-push the necessary changes.

The Problem Solved:

Without an SHC, managing configurations across multiple search heads is a nightmare. You’d have to manually deploy apps, transforms.conf changes, etc., to each individual search head. An SHC centralizes this management. The captain ensures that every search head has the exact same configuration, making the cluster behave as a single, powerful search unit and simplifying administration drastically.

The One Thing Most People Don’t Know:

The DFS path is not just for storing the configuration bundle; it’s also where the captain writes its "leader lease" file. This small file, typically named something like leader.lock or similar, contains the URI of the current captain and a timestamp. Search heads that are not the captain constantly poll this file. If the timestamp is stale (meaning the captain hasn’t updated it recently), they initiate an election. This mechanism is surprisingly simple but incredibly effective at detecting captain failures, even if the failure isn’t a full system crash but just a temporary network partition or a search head that becomes unresponsive to DFS writes.

The next thing you’ll likely encounter is dealing with configuration conflicts or understanding how search jobs are distributed across the cluster.

Want structured learning?

Take the full Splunk course →