Network Resilience: Architecting for Zero Downtime

Redundancy and failover in SRE aren’t about having a spare, they’re about making the network actively pretend it’s broken when it isn’t, so it learns to cope with the real thing.

Let’s watch a packet navigate a redundant network. Imagine two identical routers, Router A and Router B, connected to the same switch. A client machine, Client 1, is plugged into that switch.

Client 1 ---- Switch ---- Router A (Active)
                  |
                  ---- Router B (Standby)

When Client 1 sends a packet destined for an external IP, the switch forwards it to Router A, which is currently designated as the "active" gateway. Router A processes the packet and sends it out. If Router A suddenly dies, the switch, through a protocol like VRRP (Virtual Router Redundancy Protocol), detects that the active router is gone. It then signals Router B to take over the "virtual IP" address that Client 1 is using as its gateway. Router B becomes active, and subsequent packets from Client 1 are routed through it. The entire failover takes milliseconds, often imperceptible to the end-user.

The core problem this solves is single points of failure. In a traditional network, if a router, switch, or even a cable breaks, an entire segment of users can lose connectivity. Redundancy means that for every critical component, there’s a second one ready to step in. Failover is the mechanism by which the network automatically detects the failure and switches traffic to the backup.

Internally, protocols like VRRP, HSRP (Hot Standby Router Protocol), or GLBP (Gateway Load Balancing Protocol) are at play. These protocols allow multiple routers to share a single virtual IP address. One router is designated "active" and handles traffic, while the others are "standby." They constantly "heartbeat" each other. If the active router stops sending heartbeats, a standby router assumes the active role and starts responding to ARP requests for the virtual IP.

The levers you control are primarily the configuration of these redundancy protocols. This includes:

Priority: Assigning a higher priority to the router you want to be the primary gateway. vrrp 1 priority 150
Preemption: Deciding if a higher-priority router should automatically reclaim the active role if it comes back online. vrrp 1 preempt delay minimum 60
Timers: Tuning how frequently heartbeats are sent and how long to wait before declaring a router down. vrrp 1 timers advertise 10 30 (advertise every 10s, hold down for 30s)
Tracking: Configuring the routers to monitor the status of upstream or downstream interfaces. If a critical link goes down, the router can voluntarily relinquish its active role, even if it’s still functional. track 1 ip route 192.168.1.0 255.255.255.0 (track a specific route)

Most people focus on the router-to-router redundancy, but it’s crucial to extend this to the switch layer as well. Using technologies like MLAG (Multi-chassis Link Aggregation) or stacking allows two physical switches to act as a single logical switch. This means a server can have NICs connected to both switches, and if one switch fails, the server still has connectivity.

The most surprising thing about failover is that it’s often tested by deliberately breaking things in a controlled environment. SREs regularly simulate failures – unplugging cables, shutting down interfaces, or even rebooting active devices – to verify that the failover mechanisms work as expected and that the recovery time objectives (RTOs) are met. This proactive testing is what builds confidence in the system’s resilience.

The next logical step after ensuring your core gateway and switch infrastructure is redundant is to consider redundant paths for your actual data flow, often through BGP or other dynamic routing protocols.

Related Concepts

More Deep Dives in Reliability Engineering (SRE)