Circuit Breaker Pattern: Stop Cascading Failures

Circuit breakers, when implemented correctly, don’t just prevent cascading failures; they actively participate in the graceful degradation of a system, allowing it to recover and remain partially functional rather than collapsing entirely.

Let’s see a circuit breaker in action. Imagine a service UserAuth that depends on another service UserProfile. If UserProfile becomes slow or unavailable, UserAuth should stop bombarding it with requests.

import requests
from pybreaker import CircuitBreaker

def get_user_profile_data(user_id):
    try:
        response = requests.get(f"http://user-profile-service:8080/users/{user_id}")
        response.raise_for_status()  # Raise an exception for bad status codes
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"Error fetching user profile for {user_id}: {e}")
        raise # Re-raise to let the circuit breaker know something went wrong

# Configure the circuit breaker
# After 5 consecutive failures, the breaker will open.
# It will remain open for 60 seconds.
# After 60 seconds, it will transition to half-open.
breaker = CircuitBreaker(fail_max=5, reset_timeout=60)

# Decorate the function that calls the potentially failing service
@breaker
def reliable_get_user_profile_data(user_id):
    return get_user_profile_data(user_id)

# --- Example Usage ---
user_id = 123

print("Attempting to get user profile data...")
try:
    profile_data = reliable_get_user_profile_data(user_id)
    print("Successfully retrieved profile data:", profile_data)
except Exception as e:
    print("Failed to retrieve profile data:", e)

# Simulate a failure in the UserProfile service (e.g., by stopping the service)
# After 5 calls to reliable_get_user_profile_data, the breaker will open.
# Subsequent calls will immediately raise CircuitBreakerError without even
# attempting to contact the UserProfile service.
print("\nSimulating UserProfile service failure...")
for _ in range(7): # Make more than fail_max calls
    try:
        profile_data = reliable_get_user_profile_data(user_id)
        print("Attempt successful (this shouldn't happen if service is down).")
    except Exception as e:
        print(f"Attempt failed as expected: {type(e).__name__}")
        # If the breaker is open, this will be pybreaker.CircuitBreakerError
        # If it's still closed but the service is down, it will be a requests.exceptions.RequestException

# After 60 seconds (in a real scenario, not here), the breaker would go to half-open
# and allow one request to test if the downstream service is back up.

The core problem circuit breakers solve is uncontrolled resource exhaustion during partial system outages. When one service fails, dependent services might continue to send it requests, consuming their own resources (threads, connections, memory) in the process. This can lead to a cascading failure, where the initial problem brings down the entire system. A circuit breaker acts as a safety valve, detecting failures and preventing further requests to the unhealthy service.

Internally, a circuit breaker operates in three states:

Closed: The default state. Requests are allowed to pass through to the protected service. If a request fails, the failure count is incremented. If the failure count reaches fail_max, the breaker trips and moves to the Open state.
Open: Requests are immediately rejected with an error (e.g., CircuitBreakerError) without attempting to call the protected service. After a specified reset_timeout, the breaker moves to the Half-Open state. This prevents the failing service from being overwhelmed and gives it time to recover.
Half-Open: A single "test" request is allowed through. If this request succeeds, the breaker resets the failure count and moves back to Closed. If the test request fails, the breaker immediately returns to Open, restarting the reset_timeout timer. This state is crucial for automatically detecting when the downstream service has recovered.

The key levers you control are:

fail_max: The number of consecutive failures that will cause the breaker to trip (open). Setting this too low might cause unnecessary tripping on transient network blips. Setting it too high means the breaker stays closed for too long during a real outage. A common starting point is between 3 and 10, depending on the expected reliability of the dependency.
reset_timeout: The duration (in seconds) the breaker stays open before transitioning to half-open. This should be long enough for the downstream service to potentially recover or for an operator to intervene. If it’s too short, the breaker might repeatedly trip and open before the dependency is stable. Values like 30, 60, or 120 seconds are typical.
call_timeout (often part of the breaker configuration or the underlying HTTP client): The maximum time a single request is allowed to take before being considered a failure. This is critical for preventing requests from hanging indefinitely, which is a primary cause of resource exhaustion. A value slightly longer than your acceptable latency for the dependency is appropriate, perhaps 2-5 seconds.
exclude or include exceptions: You can configure which types of exceptions should count as failures. For example, you might want to exclude ConnectionRefusedError if it indicates a completely unrecoverable state that shouldn’t trigger a temporary breaker open, but rather a more immediate alert. Or you might include only specific ServiceUnavailable exceptions.

Many libraries, like pybreaker in Python or Resilience4j in Java, allow you to configure what constitutes a "failure." By default, it’s usually any exception raised by the protected code. However, you can often specify a predicate function to determine if a specific exception should count. For instance, you might decide that only HTTP 5xx errors from a downstream service should count towards tripping the breaker, while HTTP 4xx errors (client errors) should not, as they might indicate a problem with the calling service’s logic rather than the dependency’s availability. This granular control prevents the breaker from incorrectly opening due to your own service’s misbehavior.

The most surprising mechanical detail is how the reset_timeout interacts with the Half-Open state. It’s not just a cooldown period; it’s an active probe. The breaker doesn’t guess when the service is back. It tests it. If that single test request fails, the breaker doesn’t just stay open; it immediately re-opens and restarts the reset_timeout timer. This prevents a scenario where a partially recovered service, still unstable, causes the breaker to oscillate between half-open and open repeatedly, as it would if it allowed multiple test requests.

Once a circuit breaker is correctly implemented and configured, the next logical step is to integrate distributed tracing to visualize the breaker’s state transitions across your microservices.

Related Concepts

More Deep Dives in System Design