Global Systems: Avoiding Multi-Data Center Outages

Designing for global high availability isn’t about making sure your service never goes down; it’s about ensuring that when a disaster strikes one region, your users in other regions don’t even notice. The surprising truth is that true global HA often means accepting some downtime in the worst-case scenario for a specific region, while guaranteeing continuity elsewhere.

Let’s see this in action. Imagine a simple web application served from two AWS regions, us-east-1 and eu-west-1.

# Global DNS Configuration (e.g., AWS Route 53)
# Health checks are crucial here.
Resources:
  - Type: AWS::Route53::RecordSetGroup
    Properties:
      HostedZoneName: example.com.
      RecordSets:
        - Name: www.example.com.
          Type: A
          AliasTarget:
            DNSName: dualstack.elb-us-east-1.example.com.
            HostedZoneId: Z1PA6795UKMFR9 # Example ELB Hosted Zone ID for us-east-1
          # Failover routing policy
          Failover: PRIMARY
          SetIdentifier: primary-us-east-1
          HealthCheckId: !Ref USHealthCheck # Reference to a health check for us-east-1

        - Name: www.example.com.
          Type: A
          AliasTarget:
            DNSName: dualstack.elb-eu-west-1.example.com.
            HostedZoneId: Z2NY552589955L # Example ELB Hosted Zone ID for eu-west-1
          Failover: SECONDARY
          SetIdentifier: secondary-eu-west-1
          HealthCheckId: !Ref EUHealthCheck # Reference to a health check for eu-west-1

# Example Health Check Configuration
Resources:
  - Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        Type: CLOUDWATCH_METRIC
        CloudWatchAlarmConfiguration:
          EvaluationPeriods: 2
          MetricName: HealthyHostCount
          Namespace: AWS/ApplicationELB
          Period: 60
          Statistic: Minimum
          Threshold: 1
          ComparisonOperator: GreaterThanOrEqualToThreshold
          Dimensions:
            - Name: LoadBalancer
              Value: app/my-alb-us-east-1/abcdef1234567890
        # This health check will be associated with the primary record set
        # and will trigger a failover if the alarm is NOT met.
      # ... (other properties for health check)

In this setup, www.example.com resolves to the us-east-1 ELB by default (Failover: PRIMARY). Route 53 constantly monitors the health of us-east-1 via a CloudWatch alarm. If that alarm indicates the us-east-1 ELB is unhealthy (e.g., HealthyHostCount drops to 0 for 2 consecutive minutes), Route 53 automatically starts returning the IP address for the eu-west-1 ELB (Failover: SECONDARY) to new DNS queries. Users hitting eu-west-1 are unaffected. The critical piece is the health check mechanism – it’s the eyes and ears of your global failover.

The problem this solves is regional isolation. If a natural disaster, a major network outage, or a catastrophic deployment error takes down an entire AWS region, your application should continue to serve traffic from unaffected regions. This isn’t just about redundancy; it’s about geographic distribution of your service.

Internally, this relies on a combination of:

Global DNS Routing: Services like AWS Route 53, Akamai GTM, or Cloudflare Load Balancing can direct traffic based on geographic location, latency, or health check status.
Regional Deployments: Your application stack (servers, databases, load balancers) must be deployed identically and independently in multiple regions.
Health Checks: Automated checks that verify the availability and functionality of your application in each region. These are the triggers for failover.
Data Replication: This is often the hardest part. For stateful services, data must be replicated asynchronously or synchronously across regions to ensure consistency after a failover.

The levers you control are primarily:

Health Check Granularity: What specific metrics or endpoints are you checking? A simple ping isn’t enough; you need to check if your application is actually responding correctly.
Failover Latency: How quickly does DNS switch regions? This is influenced by DNS TTLs, health check intervals, and the DNS provider’s propagation speed.
Data Consistency Model: Are you okay with losing a few minutes of data (eventual consistency) or do you need strong consistency (which is much harder and more expensive globally)?
Deployment Automation: How quickly can you spin up a new, healthy instance of your application in a different region if needed?

One aspect often overlooked is the "blast radius" of configuration changes. If a bad configuration is deployed globally simultaneously, it can break all regions. A multi-region strategy encourages regional deployments and phased rollouts, allowing you to isolate bad changes to a single region before they impact the entire fleet. This means your CI/CD pipeline needs to be region-aware, capable of deploying to us-east-1 and eu-west-1 independently.

The next logical problem you’ll run into is managing stateful services across these regions, particularly ensuring data consistency during and after a failover.

Related Concepts

More Deep Dives in Reliability Engineering (SRE)