Vitess health checks don’t just tell you if a tablet is alive; they actively decide how quickly a faulty tablet gets taken out of rotation.

Let’s watch a tablet try to die.

Imagine vtgate talking to vttablet instances. When vttablet is healthy, it responds to requests. When it’s not, it might stop responding, or start responding slowly. vtgate needs to know when to stop sending traffic to a failing vttablet. It does this with periodic health checks. These checks are usually gRPC calls to the vttablet’s health endpoint.

Here’s the crucial part: vtgate doesn’t just check once and give up. It retries. And it has a timeout for each individual check. The interval between these checks, combined with the retry count and timeout, determines how quickly vtgate declares a vttablet unhealthy.

Let’s see it in action. Suppose a vttablet process becomes unresponsive.

# On a vtgate pod (assuming kubectl access)
# First, find the relevant vttablet endpoints from vtgate logs
kubectl logs <vtgate-pod-name> -c vtgate | grep "health check"

# You'll see lines like this, indicating a health check to a specific vttablet

I0801 10:00:00.123456 1 health_check.go:150] healthcheck: checking vtgate-01.{{.Cells.Cell1}}.{{.Keyspace.Name}}.{{.Shard.Name}}.{{.Tablet.Alias.Cell}}-{{.Tablet.Alias.Uid}}

Now, let’s look at the vttablet side for its configuration.

# On a vttablet pod
# Check the startup parameters or environment variables for health check settings

cat /etc/vttablet/{{.Keyspace.Name}}-{{.Shard.Name}}-{{.Tablet.Alias.Uid}}.json | grep health

# Or check the command line arguments if running directly
ps aux | grep vttablet

# You'll be looking for flags like:
# --grpc_health_check_interval
# --grpc_health_check_timeout
# --grpc_health_check_max_retries

By default, Vitess has a grpc_health_check_interval of 10s and grpc_health_check_max_retries of 3 with a grpc_health_check_timeout of 1s.

What does this mean mechanically? If a vttablet becomes unhealthy, vtgate will:

  1. Send a health check. It times out after 1s.
  2. Retry the health check. It times out after 1s.
  3. Retry again. It times out after 1s.
  4. After the 3rd retry (totaling 3 seconds of failed attempts, plus some overhead), vtgate marks the vttablet as unhealthy.
  5. Then, vtgate waits 10s for the next scheduled health check cycle.

So, in the worst case, a vttablet that just became unhealthy will remain in service for up to 10s + 3s (the retry/timeout duration) plus a bit more for scheduling. This is often acceptable.

But what if your Service Level Agreement (SLA) demands faster failover? If you need to detect and remove a failing tablet within, say, 5 seconds, you need to tune these parameters.

To achieve a faster failover, you’d decrease the grpc_health_check_interval and potentially decrease the grpc_health_check_timeout and grpc_health_check_max_retries.

Let’s say you want a worst-case detection time of under 5 seconds.

Example Tuning:

  • Goal: Detect failure within ~4 seconds.
  • Target Parameters:
    • --grpc_health_check_interval=2s
    • --grpc_health_check_timeout=500ms (0.5s)
    • --grpc_health_check_max_retries=3

How to apply this:

You’ll typically set these as command-line flags when starting vttablet or via environment variables in your Kubernetes deployment.

For Kubernetes deployments, you’d modify your vttablet Deployment or StatefulSet YAML:

# ...
spec:
  template:
    spec:
      containers:
      - name: vttablet
        image: "your-vitess-image"
        command: ["/usr/bin/vttablet"]
        args:
        - "--grpc_health_check_interval=2s"
        - "--grpc_health_check_timeout=500ms"
        - "--grpc_health_check_max_retries=3"
        # ... other vttablet args
# ...

Why this works mechanically:

With these new settings:

  1. vtgate sends a health check. It times out after 500ms.
  2. Retry 1 times out after 500ms.
  3. Retry 2 times out after 500ms.
  4. After the 3rd retry (totaling 1.5s of failed attempts), vtgate marks the vttablet as unhealthy.
  5. vtgate waits 2s for the next scheduled health check cycle.

The total time to detect failure is now approximately 1.5s (failed checks) + 2s (interval) = 3.5s. This falls comfortably within your 4-second target.

Important Considerations:

  • Overhead: Aggressively lowering these values increases the load on both vtgate and vttablet. Each vttablet will be pinged more frequently, and vtgate will spend more CPU on managing these checks. Ensure your infrastructure can handle it.
  • Network Jitter: Very short timeouts can lead to false positives due to transient network issues or a briefly overloaded vttablet.
  • SLA vs. Stability: Find the right balance. If your SLA is extremely strict (e.g., sub-second failover), you might need to look at other mechanisms like load balancer health checks or application-level health indicators, as Vitess’s gRPC health checks have inherent latency.

The next thing you’ll likely encounter is how vtgate uses this health information to implement more sophisticated load balancing and failover strategies, such as the smoothing_factor in its load balancing.

Want structured learning?

Take the full Vitess course →