The most surprising truth about blue-green deployments is that they don’t actually eliminate downtime; they just shift it and make it predictable and controllable.

Let’s see it in action. Imagine we have a live application running on a set of servers we’ll call "Blue." This is our production environment.

# Example: Current production servers (Blue)
kubectl get pods -l app=my-app,environment=production -o wide
# NAME                      READY   STATUS    RESTARTS   AGE   IP              NODE          NOMINATINGS
# my-app-prod-1-abcde       1/1     Running   0          2d    10.244.1.10     worker-node-1
# my-app-prod-2-fghij       1/1     Running   0          2d    10.244.1.11     worker-node-2
# my-app-prod-3-klmno       1/1     Running   0          2d    10.244.1.12     worker-node-3

# Example: Production Ingress pointing to Blue
kubectl get ingress production-ingress -o yaml
# ...
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app-prod-service
            port:
              number: 80
# ...
---
# Example: Production Service pointing to Blue pods
kubectl get service my-app-prod-service -o yaml
# ...
spec:
  selector:
    app: my-app
    environment: production
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
# ...

Now, we prepare a new version of our application, "Green." This new version runs on a separate, identical set of infrastructure. It’s completely isolated from the live "Blue" environment.

# Example: New version (Green) deployed but not yet live
kubectl get pods -l app=my-app,environment=staging
# NAME                      READY   STATUS    RESTARTS   AGE   IP              NODE          NOMINATINGS
# my-app-staging-1-pqrst    1/1     Running   0          5m    10.244.2.15     worker-node-4
# my-app-staging-2-uvwxy    1/1     Running   0          5m    10.244.2.16     worker-node-5
# my-app-staging-3-zabcd    1/1     Running   0          5m    10.244.2.17     worker-node-6

# Example: Staging Service pointing to Green pods
kubectl get service my-app-staging-service -o yaml
# ...
spec:
  selector:
    app: my-app
    environment: staging
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
# ...

We can test "Green" thoroughly. We can run integration tests, performance tests, and even do limited, internal user acceptance testing without impacting the "Blue" production traffic. We can hit myapp.example.com directly and see the old version, then maybe use a special internal hostname or curl with a specific header to hit the "Green" environment and verify it works.

The goal is to have a completely ready, tested, and verified "Green" environment sitting idle, waiting for the switch.

The magic happens at the traffic routing layer. This is often an Ingress controller, a load balancer, or a service mesh. The key is that this router is configured to send all incoming traffic for myapp.example.com to either the "Blue" service or the "Green" service.

To perform the "release," we simply update the router’s configuration.

# The critical step: Update the Ingress to point to the Green service.
# This is a simplified example; in reality, you might use a controller like Nginx Ingress
# or a service mesh like Istio.

# Option 1: If using a Service that selects based on labels, and you have separate services for Blue/Green
# First, ensure the Green service is ready and pointing to Green pods.
# Then, update the Production Ingress to point to the *Green* service.
# This might involve editing the ingress resource:
# kubectl edit ingress production-ingress
# ...
#   rules:
#   - host: myapp.example.com
#     http:
#       paths:
#       - path: /
#         pathType: Prefix
#         backend:
#           service:
#             name: my-app-staging-service # <-- Switched from my-app-prod-service
#             port:
#               number: 80
# ...

# Option 2: If using a single service and changing the selector (less common for true blue-green, more like canary)
# For a strict blue-green, you'd typically have two distinct services and switch the router.
# If you were to switch the selector on a single service (which is risky and not pure blue-green):
# kubectl patch service my-app-prod-service -p '{"spec":{"selector":{"app":"my-app", "environment":"staging"}}}'
# This is generally NOT how you do blue-green; you switch the *backend* of the router.

The router immediately starts sending new incoming requests to the "Green" environment. Existing connections to "Blue" might still be active for a short period, but no new requests will land there. This is the "downtime" – a brief window where some users might see the old version if their connection is long-lived, or if the transition isn’t perfectly atomic. However, there’s no period where no requests are being served.

Once we’re confident "Green" is stable and handling traffic, we can decommission or update the "Blue" environment for the next release. If something goes wrong with "Green," we can immediately switch the router back to "Blue" in seconds, effectively rolling back the deployment.

The key mechanism is that the application instances themselves are never modified while serving live traffic. You deploy a new set of instances ("Green"), test them independently, and then atomically switch traffic.

What most people don’t realize is the crucial role of session stickiness or long-lived connections. If your application relies heavily on sticky sessions or has users with very long-lived websockets or background tasks, simply switching the router might leave those users on the "Blue" environment. A robust blue-green strategy needs to account for gracefully draining these connections from the old environment before it’s fully decommissioned, or at least be aware that a small subset of users might continue interacting with the old version for a while. This is why "zero-downtime" is often "near-zero-downtime" or "controlled-downtime."

The next logical step after mastering blue-green deployments is understanding how to implement automated health checks and automated rollback triggers based on real-time traffic metrics.

Want structured learning?

Take the full Sre course →