SRE Canary Deployment: Gradually Release to Production (2026)

A canary deployment doesn’t actually rely on birds; it’s a strategy to roll out new software versions by sending a small percentage of traffic to the new version first, like a canary in a coal mine, to detect issues before a full rollout.

Let’s see this in action. Imagine we have a web service running on Kubernetes. We’re using a service mesh like Istio for traffic management, which makes canary deployments incredibly flexible.

Here’s our initial setup. We have a Service object pointing to our current stable deployment, let’s call it v1:

apiVersion: v1
kind: Service
metadata:
  name: my-app-service
spec:
  selector:
    app: my-app
    version: v1 # This selector points to our stable version
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080

And our Deployment for v1:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
      version: v1
  template:
    metadata:
      labels:
        app: my-app
        version: v1
    spec:
      containers:
      - name: app
        image: my-registry/my-app:v1
        ports:
        - containerPort: 8080

Now, we want to deploy a new version, v2. We create a new Deployment for it:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-v2
spec:
  replicas: 2 # Start with fewer replicas for the canary
  selector:
    matchLabels:
      app: my-app
      version: v2
  template:
    metadata:
      labels:
        app: my-app
        version: v2
    spec:
      containers:
      - name: app
        image: my-registry/my-app:v2
        ports:
        - containerPort: 8080

With Istio, we don’t change the Service directly. Instead, we use VirtualService and DestinationRule to control traffic flow.

First, a DestinationRule to define our two versions:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: my-app-dest
spec:
  host: my-app-service # The Kubernetes Service name
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2

This tells Istio that my-app-service can have traffic routed to subsets labeled v1 and v2.

Now, the VirtualService to manage the traffic split. Initially, 100% goes to v1:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-app-vs
spec:
  hosts:
  - my-app-service
  http:
  - route:
    - destination:
        host: my-app-service
        subset: v1 # All traffic goes to v1
      weight: 100

To start the canary, we update the VirtualService to send a small percentage, say 5%, to v2:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-app-vs
spec:
  hosts:
  - my-app-service
  http:
  - route:
    - destination:
        host: my-app-service
        subset: v1
      weight: 95 # 95% to v1
    - destination:
        host: my-app-service
        subset: v2
      weight: 5 # 5% to v2

Now, 5% of incoming traffic to my-app-service will hit v2. We monitor metrics (error rates, latency, resource usage) for both v1 and v2. If v2 shows a significantly higher error rate or latency, we can immediately roll back by setting the weight for v2 back to 0.

If v2 looks good, we gradually increase its weight. For example, to 25%:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-app-vs
spec:
  hosts:
  - my-app-service
  http:
  - route:
    - destination:
        host: my-app-service
        subset: v1
      weight: 75
    - destination:
        host: my-app-service
        subset: v2
      weight: 25

We continue this process, increasing the percentage of traffic to v2 (e.g., 50%, 75%, 100%), pausing at each stage to observe. Once v2 handles 100% of traffic and metrics remain stable, we can consider v2 the new stable version. We can then remove the v1 deployment and update the VirtualService to only point to v2.

The core problem canary deployments solve is reducing the blast radius of bad deployments. Instead of an all-or-nothing release that can take down your entire service, you expose a tiny fraction of users to potential issues, giving you time to catch and fix problems before they impact everyone. This is achieved by having two versions of your application running concurrently and using a traffic management layer (like a service mesh or an API gateway) to gradually shift traffic from the old version to the new one.

A common misconception is that canary deployments are only for large-scale, high-traffic systems. In reality, they are incredibly valuable for teams of any size as they provide a safety net. Even with thorough testing, unforeseen issues can arise in production due to unique traffic patterns, external dependencies, or system load.

The real power of this approach lies in its automation potential. When combined with robust monitoring and alerting, you can configure automated rollback mechanisms. If v2’s error rate exceeds a predefined threshold (e.g., 1% for 5 minutes), the VirtualService can be automatically updated to send 0% traffic to v2, effectively reverting to v1 without manual intervention.

The next concept you’ll likely encounter is blue-green deployments, which offer a similar goal but with a different execution strategy.

More Deep Dives in Sre