Incident Comms: Beyond Status Updates

Status pages are more than just a public-facing status indicator; they’re the primary tool for managing customer perception and reducing inbound noise during an incident. War rooms, on the other hand, are the high-bandwidth, real-time collaboration spaces where the actual incident resolution happens.

Let’s see a status page in action. Imagine a service outage.

{
  "incident_id": "INC-12345",
  "status": "investigating",
  "created_at": "2023-10-27T10:00:00Z",
  "updated_at": "2023-10-27T10:15:00Z",
  "components": [
    {
      "name": "API Gateway",
      "status": "degraded"
    },
    {
      "name": "User Authentication Service",
      "status": "major_outage"
    },
    {
      "name": "Database Cluster",
      "status": "operational"
    }
  ],
  "updates": [
    {
      "created_at": "2023-10-27T10:00:00Z",
      "severity": "informational",
      "message": "We are currently investigating an issue affecting user login."
    },
    {
      "created_at": "2023-10-27T10:15:00Z",
      "severity": "major",
      "message": "We have identified an issue with our User Authentication Service causing login failures. Our team is actively working on a resolution."
    }
  ]
}

This JSON represents an incident being tracked. The status field moves from "investigating" to "identified" or "resolved." components show the health of different parts of your system, and updates provide the narrative. A good status page will translate this into human-readable language: "We are currently investigating an issue affecting user login," then "We have identified an issue with our User Authentication Service causing login failures. Our team is actively working on a resolution."

The core problem status pages solve is managing the information asymmetry between the engineers fixing the problem and the users experiencing it. Without a status page, users default to their imagination, which often conjures scenarios far worse than reality. This leads to a flood of support tickets, social media complaints, and general panic. A well-maintained status page channels this communication, providing a single source of truth that calms nerves and directs attention.

Internally, the status page is often driven by an incident management tool. When an SRE declares an incident, they select affected components and provide an initial summary. As the incident progresses, the SRE updates the status and adds more detailed, though still user-friendly, messages. The system then publishes these updates to the status page.

The levers you control are the severity levels and the message content. Severity might range from minor (e.g., a single user experiencing a glitch) to major (e.g., widespread service disruption) to critical (e.g., complete service unavailability). The messages need to be factual, transparent, and empathetic. Avoid jargon. Explain what is happening and what is being done, without over-promising on resolution times unless you’re very confident.

War rooms, conversely, are about rapid, focused problem-solving. They are not for general discussion. A typical war room might be a dedicated Slack channel (e.g., #incident-inc12345-warroom), a Zoom call, or a combination. The key is a low-friction, high-bandwidth communication channel where the incident commander, engineers, and subject matter experts can quickly share findings, propose solutions, and assign tasks.

The most surprising true thing about war rooms is how much they rely on a shared, unspoken understanding of roles and responsibilities, even before the incident begins. The incident commander doesn’t need to explicitly say "Alice, please investigate the database logs" if Alice is the primary database engineer and has implicitly taken on that role by joining the war room. This emergent order, facilitated by clear communication channels and a culture of ownership, allows for incredibly fast diagnosis and remediation.

The next concept you’ll want to dive into is incident postmortems, which analyze what happened after the fire is out.

Related Concepts

More Deep Dives in Reliability Engineering (SRE)