Vault’s HA cluster isn’t a hot-standby system where one node immediately takes over for another; it’s a leader-election system where nodes eventually converge on a single leader.
Let’s see it in action. Imagine you have three Vault servers: vault-01, vault-02, and vault-03.
# On vault-01 (current leader)
vault status
Key Value
--- -----
...
HA Enabled true
HA Cluster Address vault-01:8201
HA Mode leader-election
...
Seal Type shamir
Sealed false
...
Version 1.15.5
...
If vault-01 suddenly dies, the remaining nodes, vault-02 and vault-03, start a race to become the new leader. This isn’t instantaneous; it involves timeouts and consensus.
# On vault-02 (after vault-01 dies)
vault status
# Output might show it's not yet the leader, or it might flip quickly.
Key Value
--- -----
...
HA Enabled true
HA Cluster Address vault-02:8201
HA Mode leader-election
...
Seal Type shamir
Sealed false
...
Version 1.15.5
...
The core problem Vault HA solves is ensuring continuous availability of secrets and audit logs even if a single Vault instance fails. It does this through a leader-election mechanism, typically using Raft or Consul for coordination. In an active-standby configuration (which is really leader-election), one node is designated the "leader" and handles all read/write operations. The other nodes, the "followers," replicate the state from the leader and are ready to step in if the leader becomes unavailable.
Here’s how the leader election typically works under the hood:
- Heartbeats/Liveness Checks: Each Vault node in the HA cluster periodically sends heartbeat messages to a shared coordination service (like Consul or etcd) or directly to other Vault nodes if using Vault’s built-in Raft.
- Leader Election Protocol: If a node detects that the current leader is no longer sending heartbeats, it initiates an election. The election process is governed by the underlying consensus algorithm (Raft). Nodes vote for a leader, and a majority of nodes must agree on a single leader for the election to succeed.
- State Replication: Once a leader is elected, follower nodes synchronize their state with the new leader. This ensures that all nodes have an up-to-date copy of the Vault’s data.
- Client Redirection: Clients (applications, users) are typically configured to send requests to a load balancer. The load balancer, in turn, directs traffic to the current leader. If the leader fails, the load balancer will eventually stop receiving health checks from it and start sending traffic to a newly elected leader.
The key configuration parameters for an HA cluster are found in the Vault server configuration file (e.g., vault.hcl). For a leader-election setup using Consul for discovery:
# Example vault.hcl for a leader-election HA cluster with Consul
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = 1
}
storage "consul" {
address = "127.0.0.1:8500"
path = "vault/" # Prefix for Vault data in Consul KV
}
# For leader election, you don't explicitly set a primary/standby.
# This is handled by Vault's internal Raft or by Consul's session management.
# If using Vault's built-in Raft, the configuration would look different
# and involve 'raft_storage' instead of 'consul' or 'etcd'.
# Example using Vault's built-in Raft (less common for initial setup, more for migration)
# storage "raft" {
# path = "/opt/vault/data"
# node_id = "vault-01" # Unique ID for each node
# retry_join {
# leader_address = "vault-01:8200" # Address of a known node
# }
# retry_join {
# leader_address = "vault-02:8200"
# }
# }
# HA configuration
api_addr = "http://127.0.0.1:8200" # The address clients will use to reach *this* node
cluster_addr = "http://127.0.0.1:8201" # The address for inter-node communication
# The HA mode is implicitly leader-election when using Consul or etcd for storage.
# If using Vault's native Raft, you'd explicitly configure Raft HA.
# The primary goal is to have multiple nodes pointing to the same storage backend
# and allowing them to elect a leader.
The most surprising aspect for many is that Vault doesn’t maintain a strict, pre-defined active/standby relationship like a traditional failover cluster. Instead, it relies on a distributed consensus mechanism to dynamically elect a leader. This means that when the leader fails, there’s a brief period of unavailability (the election timeout) before a new leader is promoted. The duration of this unavailability is influenced by network latency, the number of nodes, and the underlying consensus configuration.
If you’re using Consul for HA, Vault leverages Consul’s sessions to manage leadership. A Vault node acquires a Consul session, and only the node holding the active session can be the leader. When the session expires (due to network partition or node failure), other nodes can attempt to acquire a new session and become the leader. This is why ensuring Consul is healthy and has low latency is paramount for Vault HA.
The next concept you’ll likely encounter is how to configure clients and load balancers to seamlessly switch to the new leader once an election occurs.