Game Days are designed to simulate real-world failure scenarios in a controlled environment, allowing teams to practice their incident response and remediation skills.

Here’s a SRE team running a Game Day, simulating a cascading failure in their microservices architecture:

graph TD
    A[User Request] --> B{API Gateway};
    B --> C{Auth Service};
    B --> D{Product Catalog Service};
    C --> E{User Database};
    D --> F{Inventory Service};
    F --> G{Order Service};
    G --> H{Payment Gateway};
    H --> I[External Payment Provider];
    F --> J[Product Cache];
    D --> K[Search Index];

Scenario: A sudden spike in traffic coupled with a deployment error in the Product Catalog Service causes it to become unresponsive.

Initial Impact: Users start seeing "Product not found" errors. The API Gateway begins to time out requests destined for the Product Catalog Service.

Cascading Failure:

  1. The API Gateway, unable to reach Product Catalog Service for product details, starts returning errors to users.
  2. The Inventory Service, which relies on Product Catalog Service for product metadata, also begins to fail when trying to validate products for inventory checks.
  3. The Order Service, dependent on Inventory Service for stock validation, then starts failing order placements.
  4. The Search Index, which pulls data from Product Catalog Service, becomes stale or empty, leading to search failures.
  5. The Product Cache may also become inconsistent if its invalidation mechanisms rely on the Product Catalog Service.

The Game Day Goal: The SRE team wants to observe how the Auth Service, User Database, Payment Gateway, and External Payment Provider are affected (or not affected) by this specific failure, and how quickly the team can detect, diagnose, and mitigate the Product Catalog Service issue.

Key Observables During the Game Day:

  • Latency: Increased latency on requests hitting the API Gateway and any downstream services that do have healthy paths.
  • Error Rates: Sharp increase in 5xx errors from the API Gateway and services like Order Service and Inventory Service.
  • Resource Utilization: Spikes in CPU/memory on the API Gateway as it retries failed requests, or potentially on the Auth Service if it’s being hit excessively by users trying to re-authenticate due to perceived issues.
  • Alerting: Which alerts fire, their severity, and how quickly.
  • Monitoring Dashboards: How clear and actionable are the dashboards for identifying the root cause?

Internal Mechanics of a Game Day: A Game Day isn’t just about breaking things; it’s about controlled chaos. The "attack" is typically initiated by a dedicated "Game Day Master" or "Chaos Engineer" who injects failures. This can involve:

  • Disabling Services: Using tooling like Chaos Monkey or simply stopping service instances.
  • Introducing Latency: Using network manipulation tools (e.g., tc on Linux) to slow down communication between services.
  • Simulating Resource Exhaustion: Forcing services to consume excessive CPU or memory.
  • Corrupting Data: Injecting malformed data into databases or caches.
  • Blocking Traffic: Using firewall rules or load balancer configurations to block specific endpoints.

The SRE team on "defense" has access to their standard monitoring, logging, and tracing tools. They must work through the incident playbook, identify the faulty component, and apply fixes. The Game Day Master observes their actions and may even escalate the scenario based on the team’s response.

The True Power of Observability in Game Days: What most people miss is that the real win from Game Days isn’t just fixing the immediate problem, but understanding the gaps in observability that made diagnosis harder. You might discover that your tracing context isn’t propagating correctly across all services, or that a critical metric for service health is missing from your dashboard. The exercise reveals the blind spots in your ability to see what’s happening inside the system during a crisis.

The next step after successfully mitigating this Product Catalog Service failure is to simulate a database outage in the User Database.

Want structured learning?

Take the full Sre course →