Vault’s Raft snapshot is what keeps your cluster alive, but it’s not just a simple backup. It’s the entire state of your cluster at a specific point in time, including all secrets, configurations, and even the Raft log itself.
Let’s see it in action. Imagine a 3-node Vault cluster running Raft.
# On Vault Node 1 (Leader)
vault operator raft list-peers
This might show:
Node ID: <node-id-1>
Address: <node-1-address>:8200
Voter: true
...
Node ID: <node-id-2>
Address: <node-2-address>:8200
Voter: true
...
Node ID: <node-id-3>
Address: <node-3-address>:8200
Voter: true
Now, let’s trigger a snapshot and examine it.
# On Vault Node 1 (Leader)
vault operator raft snapshot --output vault_raft_snapshot.snap
This command doesn’t just copy files. Vault serializes its entire in-memory Raft state, including the commit log up to the latest committed entry, into this single vault_raft_snapshot.snap file. This file is essentially a serialized representation of the cluster’s consensus history and its current state.
The core problem Raft snapshots solve is preventing Raft log bloat and enabling quick cluster recovery. Without snapshots, the Raft log would grow indefinitely, consuming disk space and slowing down consensus. When a new node joins or an existing node restarts, it needs a way to catch up without replaying every single log entry since the cluster’s inception. The snapshot provides this efficient starting point.
Internally, Vault uses the raft-go library. When a snapshot is taken, the library serializes the current Raft state machine. This includes:
- The FSM: This is the actual state of your Vault cluster – all the data, configurations, leases, etc.
- The Log: A portion of the Raft log, typically from the last snapshot up to the current point. This is crucial for incremental recovery.
- Metadata: Information about the current term, voted for, committed index, etc.
To restore this, you’d typically stop all Vault nodes, replace their data directories with the snapshot, and then restart them.
# On a new or existing machine, with Vault binaries installed
# Ensure the Vault data directory is empty or non-existent
mkdir -p /vault/data
cd /vault/data
# Place the snapshot file here
cp /path/to/vault_raft_snapshot.snap .
# Start Vault with the snapshot in its data directory
# Ensure the HA configuration is correct for your cluster
vault server -config=/path/to/vault.hcl
The vault server command, when it detects a snapshot file in its data_dir and no existing Raft state, will automatically load the snapshot. It reconstructs the FSM from the snapshot and bootstraps the Raft cluster using the metadata within the snapshot. This allows the node to rejoin or establish a new cluster state quickly.
The most surprising thing is how the snapshot file contains both the current state and a portion of the Raft log. When a node restores from a snapshot, it doesn’t just load the FSM. It also loads the log entries after the snapshot’s last log index. This allows it to re-apply any subsequent operations that occurred before the snapshot was taken but were not yet fully replicated or compacted. This ensures data consistency even if the snapshot wasn’t taken immediately after the very last operation.
Understanding this interplay between FSM state and the log segment within the snapshot is key to effective disaster recovery and cluster management.
The next logical step after mastering snapshots is understanding Raft log compaction and how it interacts with snapshotting to manage disk space.