Incident Severity: Sev1-Sev4 Explained for Engineers

SRE incident severity levels are less about the technical impact of an issue and more about the business impact, often measured by user-facing disruption.

Let’s look at a typical e-commerce site during a flash sale.

Imagine a scenario where a user tries to add an item to their cart, but the "Add to Cart" button spins indefinitely, never completing the action. This isn’t a system crash, but for a customer trying to snag a limited-time deal, it’s a critical failure.

{
  "incident_id": "INC-20231027-001",
  "severity": "SEV1",
  "title": "Users unable to add items to cart during Flash Sale",
  "status": "Investigating",
  "affected_services": ["cart-service", "product-catalog-service"],
  "start_time": "2023-10-27T10:00:00Z",
  "impact_description": "Users are reporting that the 'Add to Cart' button is unresponsive, preventing purchases. This is directly impacting revenue generation during a peak sales event."
}

When an incident like this occurs, SREs classify it to ensure the right resources and urgency are applied. The levels are a spectrum, but here’s a breakdown of how they’re typically used:

SEV1: Critical System Failure

This is the big one. A SEV1 incident means a core service is completely down or severely degraded, impacting a significant portion of users and causing direct, immediate business loss. Think of it as the "fire alarm" level.

What it looks like: The entire website is down, payment processing is failing for everyone, or a critical user flow (like checkout) is completely broken.
Business Impact: Massive revenue loss, severe brand damage, potential legal or regulatory issues.
Response: Immediate, all-hands-on-deck response. Pager alerts fire, war rooms are convened, and the highest priority is given to resolving the issue.

SEV2: Major Service Degradation

A SEV2 incident is serious but not a complete outage. A key feature might be broken for many users, or a core service is functioning but at a significantly reduced capacity, leading to widespread user frustration and some business impact.

What it looks like: Users can browse products but can’t add to their cart (like our example), search functionality is returning incorrect results, or the login system is intermittently failing.
Business Impact: Significant user experience degradation, noticeable revenue impact, potential for customer churn.
Response: High urgency, dedicated SRE team focus, but perhaps not the entire company dropping everything. Resolution is a top priority.

SEV3: Minor Service Degradation or Isolated Impact

At this level, the issue affects a smaller subset of users or a less critical feature. It’s noticeable and problematic for those impacted, but the overall business operation continues with minimal disruption.

What it looks like: A specific page is loading slowly, a non-essential feature (like wishlists) is broken for some users, or an internal tool used by a specific department is experiencing issues.
Business Impact: Annoyance for a segment of users, minor potential for lost sales, but not a crisis.
Response: A dedicated SRE team will work on it, but it might be prioritized alongside other tasks. It doesn’t typically trigger widespread alerts.

SEV4: Cosmetic or Informational Issues

SEV4 incidents are the least severe. They are often minor bugs, typos, or issues that don’t directly impact functionality or revenue. They are important to fix for polish and long-term health, but not urgent.

What it looks like: A broken link on a marketing page, a misaligned button on a rarely used form, or an incorrect log message.
Business Impact: Negligible. May slightly affect user perception of quality if noticed.
Response: Handled by the SRE team during regular work hours, often batched with other low-priority tasks or bug fixes.

The classification isn’t always black and white. A SEV3 issue can quickly escalate to a SEV2 if it starts affecting more users or if it’s discovered to be a symptom of a larger problem. Conversely, a SEV1 might be downgraded if the initial assessment was overly cautious and the actual impact is contained. The key is that the severity classification drives the response – how quickly we investigate, who we involve, and what resources are allocated.

The most surprising true thing about incident severity is that it’s often determined by the time of day and business context. A minor bug that causes a single user to fail checkout at 3 AM might be a SEV3. The exact same bug, during peak Black Friday shopping hours, could easily become a SEV1 because the immediate business loss is exponentially higher.

Consider the "Add to Cart" problem again. A SRE might look at the cart-service logs and see a flood of java.lang.NullPointerException originating from the processAddToCartRequest method.

// Snippet from cart-service.log
2023-10-27 10:01:15.123 ERROR [http-nio-8080-exec-5] c.e.c.CartController - Error processing add to cart request:
java.lang.NullPointerException: null
    at com.example.cart.service.CartService.addItem(CartService.java:152)
    at com.example.cart.controller.CartController.addToCart(CartController.java:88)
    ...

This NPE is happening because a required product_id parameter is unexpectedly null in some incoming requests. The immediate fix might be to add a null check and return a 400 Bad Request to the client if the product_id is missing.

// Snippet from CartService.java
public void addItem(String userId, String productId, int quantity) {
    if (productId == null || productId.isEmpty()) { // Added null check
        throw new IllegalArgumentException("Product ID cannot be null or empty.");
    }
    // ... existing logic ...
}

However, the real problem isn’t the NPE itself, but why product_id is sometimes null. This could be due to a recent deployment in the frontend application that’s failing to fetch product details correctly before sending the add-to-cart request, or a change in the product-catalog-service API that’s causing partial data to be returned. The SEV1 classification compels the team to investigate both possibilities simultaneously, not just patch the symptom in cart-service.

The next challenge you’ll face is determining the root cause of why the product_id is becoming null in the first place, which will likely involve diving into frontend logs or the product-catalog-service’s API responses.

Related Concepts

More Deep Dives in Reliability Engineering (SRE)