A circuit breaker in Traefik doesn’t actually stop traffic; it redirects it away from services that are demonstrably failing, preventing a cascade of errors and protecting both your users and your backend services.
Let’s see it in action. Imagine you have a simple web service running on port 8080, and Traefik is routing traffic to it.
# docker-compose.yml
version: '3.7'
services:
whoami:
image: traefik/whoami
ports:
- "8080:80"
traefik:
image: traefik:v2.10
ports:
- "80:80"
- "8080:8080" # For Traefik's dashboard
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./traefik.yml:/etc/traefik/traefik.yml:ro
- ./dynamic_conf:/etc/traefik/dynamic_conf
networks:
- traefik-net
networks:
traefik-net:
external: true
And the Traefik configuration:
# traefik.yml
log:
level: INFO
api:
dashboard: true
insecure: true # For demo purposes, do not use in production
entryPoints:
web:
address: ":80"
providers:
docker:
exposedByDefault: false
file:
directory: /etc/traefik/dynamic_conf
watch: true
Now, let’s define a simple dynamic configuration that enables the circuit breaker for our whoami service.
# dynamic_conf/circuitbreaker.yml
http:
routers:
whoami-router:
rule: "Host(`localhost`)"
service: "whoami-service"
entryPoints:
- web
services:
whoami-service:
loadBalancer:
servers:
- url: "http://whoami:80"
# Circuit breaker configuration
circuitBreaker:
# Check every 10 seconds if the backend is healthy
checkPeriod: 10s
# If 50% of requests fail, trip the breaker
statusCodes:
- "500"
- "502"
- "503"
- "504"
# Trip the breaker if 50% of requests fail
requestVolume: 10
# After tripping, wait 30 seconds before allowing a single request to test
tripDuration: 30s
# If the test request fails, keep the breaker tripped for another 30s
fallbackStatusCode: 503
When you run these configurations and then stop the whoami container, Traefik won’t immediately start returning 503 errors for every request. Instead, it will attempt to send requests to the whoami service. Based on the requestVolume and statusCodes defined, Traefik will count the failures. Once the failure rate exceeds the configured threshold (50% in this case, over 10 requests), the circuit breaker will "trip."
From this point on, for the duration specified by tripDuration (30 seconds), Traefik will not send any requests to the whoami service. Instead, it will immediately return the fallbackStatusCode (503 Service Unavailable). This prevents your application from hammering a dead service, which could lead to resource exhaustion on the backend or further cascading failures. After the tripDuration elapses, Traefik will allow a single request to go through to test if the backend has recovered. If it fails again, the breaker trips for another tripDuration. If it succeeds, the circuit breaker is reset, and normal traffic flow resumes.
The circuit breaker is a critical component of a resilient microservices architecture, allowing you to gracefully handle temporary backend failures without impacting the overall availability of your system. It’s a proactive measure that shifts the focus from reacting to failures to preventing them from spiraling out of control.
The most counterintuitive aspect of Traefik’s circuit breaker is that it relies on a defined "failure rate" rather than a strict number of consecutive failures to trip. This means a service could experience a few isolated, intermittent issues without triggering the breaker, preserving availability. However, if those issues become more persistent and start affecting a significant portion of requests, the breaker will intervene.
What most people don’t realize is how granularly you can control the circuit breaker’s behavior. The statusCodes parameter isn’t just a list of HTTP errors; it’s a direct signal to Traefik about what constitutes a "failed" request from the perspective of the backend service itself. This allows you to tailor the breaker’s sensitivity to the specific error profiles of your applications. For instance, an API that sometimes returns a 503 on overload but otherwise functions fine might have a higher requestVolume and tripDuration than a critical service that must never return a 500.
Once your circuit breaker is functioning correctly, the next logical step is to implement more sophisticated health checking mechanisms that Traefik can leverage.