The core idea behind modern deployments is to minimize downtime and risk by not replacing the entire running system with a new version all at once.
Let’s see this in action. Imagine we have a web service running on three servers, each serving traffic.
Server 1: v1.0 (Active)
Server 2: v1.0 (Active)
Server 3: v1.0 (Active)
We want to deploy v1.1.
Rolling Deployments
This is the most common strategy. We update one server at a time, replacing v1.0 with v1.1, and then move to the next.
- Stop traffic to Server 1.
- Update Server 1 to
v1.1. - Start traffic to Server 1.
- Stop traffic to Server 2.
- Update Server 2 to
v1.1. - Start traffic to Server 2.
- Stop traffic to Server 3.
- Update Server 3 to
v1.1. - Start traffic to Server 3.
After Step 3:
Server 1: v1.1 (Active)
Server 2: v1.0 (Active)
Server 3: v1.0 (Active)
What problem does this solve? It prevents a single point of failure during the deployment. If v1.1 has a critical bug, only a subset of users (those hitting the updated server) are affected, and we can quickly roll back by updating that server back to v1.0.
How it works internally: Load balancers or orchestration systems (like Kubernetes) manage which servers receive traffic. They drain connections from a server before it’s updated and then add it back into the pool once the update is complete and healthy. Health checks are crucial here; if an updated server fails its health check, the load balancer will stop sending traffic to it, preventing it from impacting users.
Levers you control:
- Batch size: How many servers to update at once. A batch size of 1 is the most conservative. A larger batch size is faster but riskier.
- Health check thresholds: How many successful health checks are required before considering a server healthy.
- Max unavailable: The maximum number of servers that can be down simultaneously during the deployment.
Blue-Green Deployments
This strategy involves having two identical production environments, "Blue" (current version) and "Green" (new version).
- Deploy
v1.1to the Green environment. All servers in Green are now runningv1.1. The Blue environment is still runningv1.0. - Test the Green environment thoroughly. This is a great time to run integration tests, load tests, and even allow a small percentage of internal users to access it.
- Switch traffic from Blue to Green. This is typically done by updating a DNS record or a load balancer configuration.
Before Switch:
Blue (v1.0) - Active
Green (v1.1) - Staging
After Switch:
Blue (v1.0) - Staging
Green (v1.1) - Active
What problem does this solve? It provides near-zero downtime and instant rollback. If something goes wrong in Green after the switch, you can immediately switch traffic back to the stable Blue environment.
How it works internally: The magic is in the traffic routing. A router (DNS, load balancer, API Gateway) directs all incoming requests to either the Blue or Green environment. When you’re ready, you simply flip the switch on the router. The entire old version remains available as a hot standby, ready to take over instantly.
Levers you control:
- Test environment provisioning: Ensuring the Green environment is truly identical to Blue.
- Traffic switching mechanism: The speed and reliability of your DNS or load balancer updates.
- Post-switch monitoring: How quickly you detect issues in the Green environment after traffic is live.
Canary Deployments
This is a more sophisticated approach where you gradually roll out the new version to a small subset of users or servers before making it available to everyone.
- Deploy
v1.1to a small set of servers (e.g., 5% of your capacity). These servers are now the "canaries." - Monitor these canary servers closely. Watch error rates, latency, and business metrics.
- If the canaries are healthy, gradually increase the percentage of traffic going to
v1.1(e.g., 10%, 25%, 50%, 100%). - If issues arise at any stage, immediately roll back by redirecting traffic away from the canaries and back to the stable
v1.0servers.
Initial Canary:
Server 1: v1.1 (Active, 5% traffic)
Server 2: v1.1 (Active, 5% traffic)
Server 3: v1.0 (Active, 90% traffic)
Server 4: v1.0 (Active, 90% traffic)
... and so on for all your servers
What problem does this solve? It minimizes the blast radius of bugs to a small user group, allowing for real-world testing with minimal impact. It’s excellent for detecting subtle performance regressions or user experience issues that might not appear in staging.
How it works internally: This typically involves sophisticated traffic splitting at the load balancer or API Gateway level. You configure rules to direct specific requests (e.g., based on user ID, a cookie, or a header) to the new version, while the majority of traffic continues to flow to the old version. The monitoring and alerting systems are paramount, as they must detect anomalies in the canary group very quickly.
Levers you control:
- Canary size: The initial percentage of traffic or servers receiving the new version.
- Rollout schedule: The pace at which you increase traffic to the new version.
- Targeting rules: How you define which users or requests see the new version (e.g., by user ID, geographic location, HTTP header).
In canary deployments, the ability to route traffic based on arbitrary request attributes is what makes it so powerful. You can, for instance, direct all requests from internal employees or users with a specific feature flag enabled to the new version, effectively creating a "dogfooding" environment within production. This allows for validation by a targeted, often more technical, audience before broader exposure.
The next evolution you’ll likely encounter is feature flagging, which decouples code deployment from feature release.