Traefik’s retry middleware is your second chance when your backend services are having a bad day.

Let’s see it in action. Imagine you have a simple "echo" service that just returns whatever you send it, but occasionally, it fails.

http:
  routers:
    echo-router:
      rule: "Host(`echo.localhost`)"
      service: "echo-service"
      middlewares:
        - "retry-middleware"

  services:
    echo-service:
      loadBalancer:
        servers:
          - url: "http://localhost:8080"

  middlewares:
    retry-middleware:
      retry:
        attempts: 3
        initialInterval: "100ms"
        exponentialsBackoffFactor: 2
        maxInterval: "5s"
        strategies:
          - retryOn: ["http-5xx"]

Now, if localhost:8080 is down or returns a 5xx error, Traefik won’t immediately give up. Instead, it will retry the request according to the retry-middleware configuration. You’d see the request go out, get a 503 Service Unavailable from the backend, and then Traefik would wait 100ms, retry, wait 200ms, retry, and so on, up to 3 attempts. If the backend is still failing, Traefik will eventually return a 503 to the client.

This solves the problem of transient backend failures. Network glitches, a brief service restart, or a temporary overload on your backend can all cause requests to fail. Without retries, these failures would be immediately exposed to your users, leading to a poor experience. With retries, Traefik can absorb these small hiccups, making your overall application more resilient.

Internally, when Traefik receives a request and the configured backend service returns an error matching a retryOn strategy, it doesn’t immediately forward that error to the client. Instead, it keeps the original request in a pending state. It then waits for a duration determined by initialInterval and exponentialsBackoffFactor. If the backend still fails after the first attempt, Traefik calculates the next wait time by multiplying the previous wait time by exponentialsBackoffFactor, capped by maxInterval. This exponential backoff strategy prevents overwhelming a struggling backend with rapid retries. The attempts parameter dictates the maximum number of times Traefik will try to reach the backend before giving up.

The strategies section is where you define when Traefik should retry. http-5xx is a common one, meaning any HTTP status code from 500 to 599 will trigger a retry. You can also retry on connect-error (when Traefik can’t even establish a TCP connection to the backend) or timeout (when the backend takes too long to respond). You can even define multiple strategies; Traefik will retry if any of the specified conditions are met.

What most people miss is that the retryOn field can accept a comma-separated list of error types, allowing for very granular control. For instance, retryOn: ["http-503,http-504,connect-error"] means Traefik will retry if the backend returns a 503, a 504, or if Traefik fails to connect to it. This is powerful for differentiating between temporary server issues (503/504) and outright connectivity problems.

The next step in building resilience is often implementing circuit breaking, where Traefik stops sending requests to a backend that’s consistently failing, rather than retrying indefinitely.

Want structured learning?

Take the full Traefik course →