The surprising truth about reliability patterns is that they often make systems less reliable if implemented incorrectly.

Let’s look at a simple service call: UserClient.getUser(userId). Normally, this works fine.

// Imagine this is part of a larger system processing orders
try {
    User user = userClient.getUser(order.getUserId());
    // ... process order using user data
} catch (UserNotFoundException e) {
    // User doesn't exist, handle gracefully
} catch (NetworkException e) {
    // Network issue, maybe retry?
}

Now, what if userClient is sometimes slow or unavailable?

Retries

The most common reaction is to add retries. If the call fails, try again a few times.

Problem: A transient network blip causes getUser to fail. The client retries, and the blip clears. Success! This seems good. But what if the service is available, but just overloaded? Each retry adds more load.

Diagnosis: Monitor the retry count for specific downstream calls. If retry counts spike for a particular service, that’s your first clue. Tools like Prometheus with client-side instrumentation (e.g., Micrometer) are essential.

// Count of failed attempts to getUser, grouped by error type
sum(rate(client_requests_total{method="getUser", result="error"}[5m])) by (error_type)

Common Causes & Fixes:

  1. Overwhelmed Downstream Service: The service you’re calling is drowning in requests.

    • Diagnosis: Check the downstream service’s CPU, memory, and queue lengths. Look for high latency and error rates on its side.
    • Fix: Implement exponential backoff. Instead of retrying immediately, wait longer between retries. A common strategy is initial_interval * (backoff_factor ^ attempt_number). For example, retries at 100ms, 200ms, 400ms, 800ms. This gives the overloaded service breathing room.
    • Why it works: Spreads out the load over time, preventing a thundering herd.
  2. Network Instability: Actual intermittent network issues between services.

    • Diagnosis: Use ping and traceroute from the client pod to the server pod. Check network metrics in your Kubernetes cluster (e.g., dropped packets).
    • Fix: Increase the maximum number of retries, but combine it with a longer backoff. If you retry 5 times with 1-second backoff, it might be 5 seconds total. If you retry 5 times with 5-second backoff, it’s 25 seconds. This accounts for longer-lasting network glitches.
    • Why it works: Allows more time for temporary network partitions to resolve.
  3. Downstream Service Deployments/Restarts: The service is briefly unavailable during a rolling update.

    • Diagnosis: Correlate retry spikes with deployment events for the downstream service.
    • Fix: Set a reasonable maximum retry count (e.g., 3-5) and a generous backoff (e.g., 1-5 seconds between retries). The goal isn’t to wait out an entire deployment but to catch brief moments of unavailability.
    • Why it works: Catches the service when it briefly comes back online between pod restarts.
  4. Idempotency Issues: Your retry logic is causing duplicate operations on the server-side.

    • Diagnosis: Look for duplicate side effects in your downstream service logs (e.g., double charges, duplicate data writes). This is harder to detect from the client.
    • Fix: Ensure the operation being retried is idempotent. If it’s not, use unique request IDs and have the downstream service track them to prevent re-execution. E.g., POST /orders should have a unique X-Request-ID header, and the server should return 409 Conflict if it sees the same ID again.
    • Why it works: Prevents unintended side effects by ensuring operations can be applied multiple times without changing the outcome beyond the first application.
  5. Client-Side Resource Exhaustion: Too many concurrent retries are consuming client resources (threads, connections).

    • Diagnosis: Monitor client thread pools, connection counts, and CPU usage during periods of high downstream errors.
    • Fix: Limit the total number of concurrent retries allowed. For example, use a semaphore to ensure no more than 10 retries are active at any given moment, regardless of how many initial requests failed.
    • Why it works: Prevents the retry mechanism itself from becoming a failure point.

Timeouts

Retries are often paired with timeouts. If a request doesn’t complete within a certain duration, give up.

Problem: Without a timeout, a request could hang indefinitely, consuming client resources and potentially triggering retries that also hang.

Diagnosis: Monitor the duration of your client requests. Look for requests that exceed your configured timeout.

// Count of requests that timed out, per downstream service
sum(rate(client_requests_total{result="timeout"}[5m])) by (service)

Common Causes & Fixes:

  1. Under-provisioned Downstream Service: The service is too slow to respond within the timeout.

    • Diagnosis: Measure the P95 and P99 latency of the downstream service. If P99 latency is consistently higher than your client timeout, you have a problem.
    • Fix: Increase the timeout or (preferably) fix the downstream service’s performance. A common timeout for internal HTTP calls might be 500ms, while a P99 latency of 1 second indicates a problem. Adjusting the timeout to 1.5 seconds might mask the issue temporarily, but fixing the service is the real solution.
    • Why it works: Aligns client expectations with actual service performance.
  2. Network Latency: High network latency between services.

    • Diagnosis: Use ping with a large packet size or iperf3 to measure bandwidth and latency.
    • Fix: Increase the timeout. If the round-trip time (RTT) between your services is 300ms, a 500ms timeout might be too aggressive. Set it to 1000ms to allow for network jitter.
    • Why it works: Provides a buffer for the time it takes data to travel across the network.
  3. Incorrect Timeout Value: The timeout is simply set too low.

    • Diagnosis: Review the configuration. Is it 10ms? 50ms? For anything beyond a trivial local call, this is likely too aggressive.
    • Fix: Set a reasonable baseline timeout. For inter-service HTTP calls within a data center, 500ms to 2000ms is common. For external calls, it needs to be much higher.
    • Why it works: Matches the expected latency profile of the operation.
  4. Client-Side Bottlenecks: The client is slow to process the response, making it appear as if the server timed out.

    • Diagnosis: Profile the client application. Is it CPU-bound? Is its response handler slow?
    • Fix: Optimize the client-side code. This might involve better data structures, asynchronous processing, or more efficient algorithms.
    • Why it works: Ensures the client isn’t the bottleneck causing premature timeouts.

Bulkheads

This pattern isolates failures. If one part of your system fails, it doesn’t bring down everything else. Think of the compartments in a ship’s hull.

Problem: Imagine your OrderService needs data from UserService and ProductService. If UserService becomes completely unavailable and your retries/timeouts are poorly configured, your OrderService might:

  1. Start making thousands of requests to UserService.
  2. These requests hang, consuming threads/connections.
  3. Eventually, they time out, but the sheer volume of attempted requests overwhelms the OrderService’s thread pool.
  4. Now, OrderService can’t even serve requests that don’t involve UserService, because its resources are exhausted.

Diagnosis: Monitor resource utilization (thread pools, connection pools, CPU) for your service during partial downstream failures. See if the failure of one downstream dependency causes widespread resource exhaustion.

Common Causes & Fixes:

  1. Shared Resource Pools: A single thread pool or connection pool is used for all outgoing requests.

    • Diagnosis: Use your APM tool to trace requests. See if a slow call to UserService is blocking threads that are also needed for calls to ProductService.
    • Fix: Create separate thread pools (or connection pools) for different downstream dependencies. For example, one pool for UserService calls, another for ProductService calls. Libraries like Resilience4j or Hystrix provide mechanisms for this.
    • Why it works: Limits the blast radius. If the UserService pool is saturated, the ProductService pool remains unaffected.
  2. Lack of Connection Limits: The client can open an unlimited number of connections to downstream services.

    • Diagnosis: Monitor the number of active connections from your service to specific downstream services.
    • Fix: Configure connection pool limits. For example, limit the HTTP client to a maximum of 50 connections to UserService.
    • Why it works: Prevents a single failing dependency from exhausting all available network sockets.
  3. No Circuit Breakers: When a downstream service fails repeatedly, retries and timeouts continue to hammer it.

    • Diagnosis: Observe high error rates and resource saturation on the client even after downstream errors have been occurring for a while.
    • Fix: Implement circuit breakers. After a certain number of failures within a time window (e.g., 10 failures in 1 minute), the circuit breaker "opens," and subsequent calls to that dependency fail immediately (without even attempting the network call) for a configured period. After the timeout, it enters a "half-open" state to test if the downstream service has recovered. Libraries like Resilience4j are key here.
    • Why it works: Stops overwhelming a failing service and prevents client resources from being wasted on calls destined to fail.

When you combine these patterns thoughtfully – using timeouts to detect failures quickly, retries with exponential backoff to handle transient issues, and bulkheads (especially circuit breakers) to isolate failing dependencies – you build a much more resilient system.

The next error you’ll hit after fixing these is likely related to graceful degradation: what happens when a required dependency is permanently unavailable, and even retries and circuit breakers can’t help?

Want structured learning?

Take the full System Design course →