On-call is a rite of passage for SREs, and doing it well is more art than science.
Let’s see what a typical on-call rotation actually looks like in practice. Imagine a team of five SREs: Alice, Bob, Charlie, David, and Eve. Their on-call schedule is a simple weekly rotation, starting each Monday at 9 AM PST.
- Week 1: Alice
- Week 2: Bob
- Week 3: Charlie
- Week 4: David
- Week 5: Eve
- Week 6: Alice (cycle repeats)
This rotation is managed by an automated system, often integrated with tools like PagerDuty or Opsgenie. The configuration in PagerDuty might look something like this:
# pagerduty_escalation_policy.tf
resource "pagerduty_escalation_policy" "sre_team" {
name = "SRE On-Call Rotation"
team = pagerduty_team.sre.id
num_loops = 0 # Continuous rotation
rule {
escalation_delay_in_minutes = 5
target {
type = "user"
id = pagerduty_user.alice.id
}
}
rule {
escalation_delay_in_minutes = 10
target {
type = "user"
id = pagerduty_user.bob.id
}
}
rule {
escalation_delay_in_minutes = 15
target {
type = "user"
id = pagerduty_user.charlie.id
}
}
# ... and so on for David and Eve
}
In this PagerDuty setup, Alice is the primary on-call. If she doesn’t acknowledge an alert within 5 minutes, Bob is notified. If he doesn’t acknowledge within the next 10 minutes (15 minutes total from initial alert), Charlie gets paged, and so on. This multi-level escalation is crucial for ensuring alerts are addressed promptly.
But what happens when an alert does fire? Let’s say a critical database is experiencing high latency, causing user-facing errors.
Alert Triggered: db_latency_high (Threshold: 95th percentile latency > 500ms for 5 minutes)
-
Initial PagerDuty Notification: Alice receives a PagerDuty alert on her phone and Slack. The alert includes:
- Service: Production Database Cluster A
- Severity: Critical
- Summary: 95th percentile read latency for
db_latency_highis 750ms. - Runbook Link:
https://runbooks.example.com/db_latency_high
-
Acknowledgement: Alice acknowledges the alert within 3 minutes. This stops further escalation for this specific alert.
-
Diagnosis: Alice clicks the runbook link. The runbook guides her through initial checks:
- Check CPU/Memory:
kubectl top pod <db-pod-name> -n database - Check Network:
ping <db-service-ip>from an application pod. - Check Slow Queries:
SELECT * FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '1 minute' ORDER BY query_start;(on the database itself).
- Check CPU/Memory:
-
Finding the Root Cause: Alice discovers a few active queries consuming high CPU and blocking other operations. One query is a poorly optimized
SELECT *on a massive table without aWHEREclause. -
Mitigation/Recovery: The runbook suggests killing the offending process.
- Command:
SELECT pg_cancel_backend(<pid_from_slow_query_check>); - Expected Result: Database latency drops back to normal levels within seconds.
- Command:
-
Post-Mortem & Follow-up: Alice creates a ticket to address the root cause: optimizing the slow query. This might involve adding an index, rewriting the query, or disabling the feature that generated it.
The mental model for on-call is that of a gatekeeper and first responder. You’re not expected to have every answer, but you are expected to:
- Receive: Be available and reachable when pages occur.
- Acknowledge: Confirm you’ve seen the alert to prevent cascading pages.
- Diagnose: Use provided runbooks and your knowledge to understand the problem.
- Mitigate: Take immediate steps to stop the bleeding and restore service.
- Escalate: If you can’t fix it, or if it’s beyond your scope, escalate to the right person or team.
- Document: Create tickets for follow-up work and contribute to post-mortems.
The core of effective on-call is having excellent runbooks. A good runbook is not just a list of commands; it’s a decision tree. It guides the on-call engineer through identifying the problem, understanding its impact, and executing the correct mitigation steps. It should clearly state what actions are safe to perform without further consultation and when escalation is necessary.
Crucially, the escalation policy needs to be dynamic. As teams grow and services become more complex, a static rotation might not be sufficient. Implementing skills-based routing or on-call tiers can ensure that alerts reach the engineer with the most relevant expertise first, reducing MTTR. For instance, a critical alert for the payments service might bypass the general SRE rotation and go directly to a payments-focused SRE.
The next challenge after mastering on-call rotations and escalations is effectively managing incidents themselves, which involves communication, coordination, and clear leadership during high-pressure situations.