Vault’s health endpoint is more than just a "is it alive?" check; it’s the primary signal that the entire distributed system is capable of serving requests reliably.

Here’s Vault running, actively serving secrets and responding to health checks:

{
  "version": "1.15.0",
  "server_time_utc": 1704067200,
  "leader": "vault-0",
  "leader_since": 1704067190,
  "performance_standby": false,
  "performance_standby_group": "",
  "ha_enabled": true,
  "sealed": false,
  "standby": false,
  "replication_state": "none",
  "cluster_id": "abcdef12-3456-7890-abcd-ef1234567890",
  "raft_eligible_secondary_nodes": [],
  "raft_non_voter_nodes": [],
  "raft_voter_nodes": [
    "vault-0",
    "vault-1",
    "vault-2"
  ],
  "storage_type": "raft",
  "auto_unseal": {
    "status": "disabled",
    "progress": 0
  },
  "migration_status": {
    "in_progress": false,
    "target_storage_type": ""
  },
  "error": ""
}

This output is the result of hitting the /sys/health endpoint, typically with the ?standby=true parameter if you want to see the health of standby nodes as well. The key fields here are leader, ha_enabled, sealed, and standby. If sealed is true, Vault is not serving requests; if ha_enabled is false and standby is true, it’s a standby node not participating in leadership. The error field is critical; if it’s non-empty, something is fundamentally wrong.

Vault operates as a distributed system, primarily using the Raft consensus algorithm for its High Availability (HA) mode. In HA, one node is the designated "leader," responsible for processing all write operations. Other nodes are either "standby" or "voter" nodes, replicating the Raft log and capable of stepping up as leader if the current leader fails. The health endpoint reflects the state of this Raft cluster. It tells you if Vault is initialized, unsealed, and if a stable Raft leader exists.

The telemetry endpoint, accessed via /sys/metrics, provides a stream of Prometheus-formatted metrics. These metrics are invaluable for understanding the internal workings of Vault, from Raft latency to the performance of specific authentication backends, and the success/failure rates of secret operations. You can configure the level of detail in telemetry, but the default provides a good overview of system health and performance.

Consider the interplay between Raft, sealing, and leadership. Vault must be unsealed to operate. If it’s sealed, it holds its encryption keys in memory but won’t decrypt them. In an HA setup, only the leader can accept write operations. If the leader is sealed, or if there’s no stable leader (e.g., during a Raft election or if a majority of nodes are down), Vault will not be able to serve most requests. The health endpoint aggregates this information. The error field often contains the most direct clue when something is wrong, such as "Raft is not a leader" or "Vault is sealed."

When Vault is configured for HA, and you query /sys/health, you’re not just asking "is this process running?". You’re asking: "Is there an active Raft leader? Is that leader initialized and unsealed? Can it successfully participate in Raft consensus?" If the error field is populated, it’s usually because one of these conditions isn’t met. For instance, if ha_enabled is true but leader is empty and standby is true on all nodes, it signifies a Raft election is in progress or has failed.

The telemetry endpoint offers granular insights that complement the health endpoint. For example, you might see high latency on Raft append operations (vault_raft_append_latency_seconds) or an increasing number of failed attempts to reach the leader (vault_raft_failed_heartbeat_count). These metrics, when correlated with the health endpoint reporting no leader, help pinpoint whether the issue is network partitioning, resource starvation on a node, or a full Raft quorum failure. You can expose these metrics via a Prometheus server and create dashboards to visualize trends and alert on anomalies.

It’s crucial to understand that ha_enabled being true doesn’t automatically mean HA is working. It just means the configuration is present. The health endpoint’s leader field is the definitive indicator of an active Raft leader. If you have ha_enabled = true and leader = "" across all nodes, your cluster is effectively down for writes. The telemetry metrics vault_raft_state can show if nodes are in Follower, Candidate, or Leader states, providing a real-time view of the Raft election process.

The most common reason for Vault to report an unhealthy state, specifically with an error indicating Raft issues or no leader, is a split-brain scenario where network partitions prevent a majority of Raft voters from communicating. This can happen due to firewall misconfigurations, underlying network infrastructure problems, or even resource exhaustion (CPU/memory) on one or more Vault nodes causing them to miss heartbeats. Telemetry like vault_raft_commit_index and vault_raft_applied_index can show if nodes are falling behind or disagreeing on the state of the Raft log, which is a strong indicator of Raft health.

The health endpoint’s sealed status is independent of Raft leadership. A cluster can have a healthy Raft leader, but if that leader is sealed, Vault will still refuse to serve secrets. The auto_unseal status in the health output can provide clues if auto-unseal is configured and failing to unseal the node. Telemetry metrics related to the unseal process, such as vault_unseal_progress or specific errors logged by the unseal mechanism, are vital for diagnosing these issues.

When investigating a Vault HA cluster that’s reporting unhealthy, always start with the /sys/health endpoint on all nodes. Look for the error field and the presence of a leader. Then, dive into /sys/metrics for Raft-related metrics (latency, heartbeats, commit index) and any specific errors reported. Understanding the Raft protocol’s requirement for quorum (N/2 + 1 nodes) is fundamental to diagnosing why a leader might not be elected or might be unstable.

The replication_state field in the health endpoint, when Vault is configured for replication (e.g., Raft replication or performance replication), indicates the status of data synchronization between primary and secondary clusters. If replication is broken, the secondary clusters will not have the latest data, even if they are otherwise healthy and reachable. Telemetry metrics like vault_replication_sync_duration_seconds and vault_replication_lag_seconds are critical for monitoring the health of replication.

The next thing you’ll likely encounter is troubleshooting specific backend performance issues, which will require digging into the telemetry metrics related to individual auth or secret engines.

Want structured learning?

Take the full Vault course →