Cross-region replication in Vault is surprisingly complex because it’s not a simple data copy; it’s an active-passive failover mechanism that requires careful coordination to prevent split-brain scenarios and data loss.
Let’s see it in action. Imagine you have a primary Vault cluster in us-east-1 and a standby in eu-west-1.
Here’s a snippet of how you might configure replication on the primary Vault server:
listener "tcp" {
address = "10.1.1.1:8200"
tls_disable = 1
}
storage "raft" {
path = "/vault/data"
node_id = "vault-01"
retry_join {
leader_address = "10.1.1.1:8200"
}
}
// Enable replication for a specific cluster
replication "default" {
enabled = true
}
// Configure the remote cluster for replication
replication_path "default" {
address = "http://10.2.2.2:8200" // Address of the standby cluster
token = "s.xxxxxxxxxxxxxxxxxxxxxxx" // Replication token from standby
}
And on the standby Vault server:
listener "tcp" {
address = "10.2.2.2:8200"
tls_disable = 1
}
storage "raft" {
path = "/vault/data"
node_id = "vault-standby-01"
retry_join {
leader_address = "10.2.2.2:8200"
}
}
// Enable replication for a specific cluster
replication "default" {
enabled = true
}
// Generate a replication token on the standby
// vault replication generate-token -replication-path="default"
// This token needs to be configured on the primary.
The problem this solves is ensuring that your secrets remain available even if your primary datacenter experiences a catastrophic failure. Vault replication allows you to maintain a geographically distributed, highly available instance of your secrets management system. Internally, it works by using Vault’s Raft consensus algorithm for the primary cluster and then replicating state changes (not just data dumps) to the standby cluster. This replication is asynchronous by default, meaning there’s a small window where data might not be immediately consistent across regions. The replication_path stanza on the primary points to the standby, and crucially, the standby must be configured with a replication token that the primary uses to authenticate and establish the replication stream.
The exact levers you control are primarily around network connectivity and authentication. The address in replication_path must be reachable from the primary. The token must be a valid, unexpired replication token generated on the standby. You also have control over which replication "type" is used (though default is the most common) and can configure TLS for the replication stream itself, which is highly recommended for production. You can also set max_parallel replication requests to tune performance.
One critical aspect often overlooked is the initial synchronization process. When you first set up replication, or if the standby falls too far behind, Vault initiates a full state transfer. This isn’t a simple rsync; it’s a carefully orchestrated process where the standby requests specific state entries from the primary. If network interruptions occur during this phase, the synchronization can stall or fail, and the standby might not catch up, leading to an outdated replica. You can monitor this with vault replication status -replication-path="default".
To promote a standby to primary, you execute vault operator replicate-resume -replication-path="default".