Traefik’s service load balancer doesn’t just randomly pick a backend; it’s actively trying to avoid sending traffic to dead servers, and it does this by pinging them.
Here’s Traefik routing traffic to a backend service, let’s call it my-app, which has three instances running on 10.0.0.1:8080, 10.0.0.2:8080, and 10.0.0.3:8080.
# Traefik static configuration
entryPoints:
web:
address: ":80"
providers:
docker:
exposedByDefault: false
# Traefik dynamic configuration
http:
routers:
my-app-router:
rule: "Host(`myapp.example.com`)"
service: "my-app-service"
services:
my-app-service:
loadBalancer:
servers:
- url: "http://10.0.0.1:8080"
- url: "http://10.0.0.2:8080"
- url: "http://10.0.0.3:8080"
# Health check configuration
healthCheck:
path: "/health"
interval: "10s"
timeout: "3s"
scheme: "http"
When a request for myapp.example.com arrives, Traefik’s my-app-service looks at its list of servers. Before sending the request, it consults its health check status for each server. If 10.0.0.2:8080 has failed its health check recently, Traefik will temporarily exclude it from the pool of available servers and try 10.0.0.1:8080 or 10.0.0.3:8080 instead. This ensures that users primarily interact with healthy instances of your application.
The core problem Traefik’s load balancer solves is resilience. Without it, if one instance of your my-app service crashed, Traefik would continue sending traffic to it, leading to a degraded user experience or outright failures for those unlucky users. The health check is the mechanism that allows Traefik to dynamically adapt to the real-time availability of your backend instances.
Internally, Traefik maintains a list of backend servers for each service. For each server, it keeps track of its health status. This status is updated by periodically performing the configured health check. The health check is a small request (like an HTTP GET to /health) sent to the backend server at a defined interval. If the server responds within the specified timeout with a success status code (typically 2xx or 3xx), the server is considered healthy. If it fails to respond, or responds with an error status code, it’s marked as unhealthy.
When a new request comes in, Traefik’s load balancer algorithm (e.g., round-robin, least connection) selects a server only from the set of currently healthy servers. This selection happens very rapidly for each incoming request. The health check runs in the background, asynchronously, so it doesn’t block incoming traffic. Traefik will try to re-evaluate the health of an unhealthy server at the next interval. Once a server becomes healthy again, it’s automatically added back into the pool of available servers.
The scheme, path, interval, and timeout are your primary levers. scheme (http or https) dictates how Traefik will communicate with the backend for health checks. path is the specific endpoint on your backend that Traefik will query; this should be an endpoint designed to quickly indicate service health (e.g., returning 200 OK if the service is running and can connect to its database). interval is how often Traefik probes; shorter intervals mean faster detection of failures but more load on your backends. timeout is how long Traefik waits for a response before considering the check failed; this needs to be longer than your backend’s typical response time for a health check but short enough to avoid long waits during outages.
A common pitfall is setting the interval too low relative to the timeout. If your interval is 5s and your timeout is 10s, Traefik will wait up to 10 seconds for a health check that it initiated only 5 seconds ago. This means a server could appear unhealthy for a prolonged period even if it’s just experiencing brief network latency. A more sensible configuration would have the interval significantly shorter than the timeout, for example, interval: "10s" and timeout: "3s".
The next concept you’ll likely encounter is how Traefik handles different load balancing strategies beyond simple round-robin, such as sticky sessions or weighted load balancing.