Vault Prometheus Metrics: Scrape and Alert on Vault
Prometheus metrics are the unsung heroes of modern observability, and when it comes to HashiCorp Vault, they unlock a level of operational insight you probably didn’t know you were missing.
Let’s see Vault’s metrics in action. Imagine you have a Vault server running, and you’ve enabled the Prometheus endpoint. You can curl it directly to see the raw data:
curl http://127.0.0.1:8200/v1/sys/metrics
This will dump a firehose of time-series data. For example, you might see lines like:
vault_http_requests_total{method="POST",path="/v1/auth/token/create",status="200"} 12345
vault_raft_state 2
vault_performance_standby_sync_duration_seconds{path="us-west-2/data/myapp/config"} 0.05
This isn’t just noise; it’s a direct window into Vault’s internal state and performance. vault_http_requests_total tells you how many requests are hitting specific API endpoints, vault_raft_state indicates if your Vault cluster is healthy (2 means leader), and vault_performance_standby_sync_duration_seconds shows replication lag.
The problem Vault metrics solve is moving beyond simple "is it up?" checks to understanding how it’s operating and why it might be failing. Traditional monitoring might tell you if Vault is responding to pings, but it won’t tell you if token creation is suddenly taking 500ms, or if your Raft cluster is losing leadership.
The core mechanism is Vault exposing an HTTP endpoint (/v1/sys/metrics by default) that spews metrics in the Prometheus exposition format. Prometheus, a time-series database and monitoring system, is configured to "scrape" this endpoint at regular intervals. It pulls the data, stores it, and allows you to query and visualize it. Alerting rules can then be defined based on these stored metrics.
To get started, you need to enable the Prometheus metrics endpoint in your Vault configuration. For a server-only configuration, this looks like:
listener "tcp" {
address = "0.0.0.0:8200"
tls_disable = 1
}
api_addr = "http://127.0.0.1:8200"
# Enable Prometheus metrics
metrics_api_addr = "http://127.0.0.1:8200"
Then, in your Prometheus configuration (prometheus.yml), you add a scrape job:
scrape_configs:
- job_name: 'vault'
static_configs:
- targets: ['vault-server-ip:8200'] # Replace with your Vault server's IP/hostname
metrics_path: "/v1/sys/metrics"
Once Prometheus is scraping, you can query metrics. For instance, to see the rate of failed token creation requests:
rate(vault_http_requests_total{path="/v1/auth/token/create",status=~"5.."})[5m]
This query looks at the vault_http_requests_total metric, filters for requests to the token creation path with a 5xx status code, and then calculates the per-second rate over the last 5 minutes.
Alerting is where this really shines. You can set up an alert in Prometheus based on this rate:
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093 # Your Alertmanager address
rule_files:
- "alert.rules.yml"
# ... other configs
# In alert.rules.yml
groups:
- name: vault_alerts
rules:
- alert: HighTokenCreationErrorRate
expr: rate(vault_http_requests_total{path="/v1/auth/token/create",status=~"5.."})[5m] > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High rate of token creation errors on Vault"
description: "Vault is experiencing a high rate of errors (status 5xx) for token creation requests over the last 5 minutes. Current rate: {{ $value }}"
This alert will fire if the rate of token creation errors exceeds 1 per second for 5 consecutive minutes, giving you advance warning of authentication issues.
The one aspect of Vault metrics that often surprises people is the sheer granularity of the vault_raft_ family of metrics. It’s not just about knowing if the cluster is up, but understanding the health of the Raft consensus algorithm itself, which is the beating heart of Vault’s HA setup. Metrics like vault_raft_state, vault_raft_leader_changes_total, vault_raft_commit_duration_seconds, and vault_raft_log_size provide deep insight into Raft’s performance and stability. For instance, a consistently high vault_raft_commit_duration_seconds or frequent vault_raft_leader_changes_total can be early indicators of network issues or overloaded nodes that will eventually lead to Raft instability and Vault unavailability.
Beyond basic operational alerts, you can build sophisticated dashboards in Grafana, integrating Vault metrics with network and system metrics, to paint a complete picture of your Vault cluster’s health and performance.
The next step after mastering basic scraping and alerting is to explore the metrics related to specific Vault features like Seal/Unseal operations, PKI issuance, or replication status for more targeted monitoring.