Vitess’s Prometheus monitoring isn’t just about collecting metrics; it’s about seeing the database’s internal state as if it were a black box you could peer into during live operation.

Let’s watch Vitess churn through some requests and see what Prometheus is telling us. Imagine we have a simple vtgate and a vtctld running, and we’re firing off some SQL queries.

Here’s a snapshot of what vtgate might be exporting to Prometheus:

http_requests_total{code="200",method="POST",path="/query",vt_cell="zone1",vt_tablet_cell="zone1-0000000001",vt_tablet_hostname="zone1-tablet-01",vt_tablet_type="REPLICA"} 150234
http_requests_total{code="500",method="POST",path="/query",vt_cell="zone1",vt_tablet_cell="zone1-0000000001",vt_tablet_hostname="zone1-tablet-01",vt_tablet_type="REPLICA"} 5
vtgate_query_time_seconds_count{code="OK",vt_cell="zone1",vt_tablet_cell="zone1-0000000001",vt_tablet_hostname="zone1-tablet-01",vt_tablet_type="REPLICA"} 150239
vtgate_query_time_seconds_sum{vt_cell="zone1",vt_tablet_cell="zone1-0000000001",vt_tablet_hostname="zone1-tablet-01",vt_tablet_type="REPLICA"} 75.1234

And here’s what vtctld might be exporting:

vtctld_command_count{command="ApplySchema",vt_cell="zone1",vt_tablet_hostname="zone1-vtctld-01"} 123
vtctld_command_duration_seconds_sum{command="ApplySchema",vt_cell="zone1",vt_tablet_hostname="zone1-vtctld-01"} 45.678

And a typical tablet (e.g., mysqld):

mysql_up{vt_cell="zone1",vt_tablet_cell="zone1-0000000001",vt_tablet_hostname="zone1-tablet-01",vt_tablet_type="REPLICA"} 1
go_goroutines{vt_cell="zone1",vt_tablet_cell="zone1-0000000001",vt_tablet_hostname="zone1-tablet-01",vt_tablet_type="REPLICA"} 55
mysql_bytes_received_total{vt_cell="zone1",vt_tablet_cell="zone1-0000000001",vt_tablet_hostname="zone1-tablet-01",vt_tablet_type="REPLICA"} 1024000

The problem Vitess monitoring solves is providing granular, real-time insights into the performance and health of a distributed database system. Traditional database monitoring often focuses on a single instance. Vitess, being a sharding middleware, introduces layers of complexity: requests pass through vtgate, get routed to specific vttablet instances, and vtctld manages the cluster topology. Prometheus allows us to instrument each of these components and aggregate their metrics for a holistic view.

The core idea is that each Vitess component (vtgate, vttablet, vtctld) exposes an HTTP endpoint (usually /metrics on port 8888 for vttablets, 8088 for vtgate, and 8088 for vtctld) that Prometheus can scrape. These metrics are automatically generated by the Go libraries Vitess uses, such as prometheus/client_golang. They cover request counts, latencies, internal state, and even Go runtime statistics.

You control what you see by configuring Prometheus to scrape these endpoints and then building Grafana dashboards that query Prometheus using PromQL. For example, to see the rate of failed queries handled by vtgate in a specific cell, you’d use a query like:

rate(http_requests_total{code=~"5..", path="/query"}[5m])

This tells you how many requests per second are returning a 5xx error code, specifically for requests made to the /query endpoint, over the last 5 minutes. The vt_cell label would let you break this down by datacenter.

The most surprising truth about Vitess metrics is how many of them are histograms (like vtgate_query_time_seconds) or summaries. These aren’t just simple counters; they provide distributions of values. When you query a histogram, you’re not just getting an average or a total; you’re getting quantiles. For example, histogram_quantile(0.95, sum(rate(vtgate_query_time_seconds_bucket[5m])) by (le, vt_cell, vt_tablet_cell)) will show you the 95th percentile latency of queries for each tablet, aggregated over 5-minute windows. This is crucial for understanding user experience, as averages can hide outliers that significantly impact some users.

The next concept you’ll dive into is alerting based on these metrics, defining thresholds that trigger notifications when something goes wrong.

Want structured learning?

Take the full Vitess course →