The common wisdom about SRE teams is that they’re about reliability engineering, but the real magic is how they structure work itself to prevent burnout and ensure sustainable velocity.

Let’s look at a typical SRE team’s day, managing a service with a 99.99% availability target.

{
  "service_name": "user-auth-api",
  "version": "v2.3.1",
  "team_sre": "auth-reliability",
  "alerting_rules": [
    {
      "name": "high_latency_p99",
      "threshold": "500ms",
      "severity": "critical",
      "duration": "5m",
      "handler": "oncall_auth_reliability"
    },
    {
      "name": "error_rate_5xx",
      "threshold": "1%",
      "severity": "warning",
      "duration": "10m",
      "handler": "oncall_auth_reliability"
    }
  ],
  "incident_response": {
    "sla_acknowledgement": "5m",
    "sla_resolution": "30m"
  },
  "project_backlog": [
    {
      "id": "AUTH-123",
      "title": "Implement circuit breaker for downstream payment service",
      "status": "in_progress",
      "assignee": "sre_alice",
      "estimated_effort": "2d"
    },
    {
      "id": "AUTH-124",
      "title": "Automate database schema migration rollback",
      "status": "backlog",
      "assignee": null,
      "estimated_effort": "3d"
    }
  ],
  "oncall_schedule": {
    "primary": "week_1_4",
    "secondary": "week_2_3"
  }
}

When an alert fires for high_latency_p99 exceeding 500ms for 5m, the on-call SRE, Alice, gets paged. She’ll first check the service dashboard in Datadog, looking for recent deployments, traffic spikes, or resource saturation in the user-auth-api cluster. If it’s a transient issue, she might just monitor. If it’s persistent, she’ll start investigating logs in Splunk for errors or unusual patterns, and potentially roll back a recent deployment if that’s the suspected cause. This involves checking the deployment history in Spinnaker and initiating a rollback if necessary. Her goal is to restore the 99.99% availability target within the 30m resolution SLA.

The core problem SRE teams solve is the inherent tension between delivering new features rapidly and maintaining service stability. Without a dedicated team focused on reliability, the burden often falls haphazardly onto development teams, leading to rushed fixes, technical debt, and ultimately, burnout. SRE formalizes this by creating a buffer of engineers whose primary mandate is to prevent incidents, not just respond to them.

An SRE team’s responsibilities are typically split. About 50% of their time is dedicated to "Ops" work: responding to incidents, managing the on-call rotation, and handling toil (manual, repetitive tasks that should be automated). The other 50% is "Eng" work: building automation, improving monitoring, designing for reliability, and contributing to product development from a stability perspective. This 50/50 split is crucial for preventing burnout and ensuring that the team is actively reducing the need for Ops work over time.

The boundaries are just as important as the responsibilities. SRE teams do not own the service in the same way a development team does. They are guardians of reliability, working with development teams. If a development team is building a new feature that might impact latency, the SRE team will consult on design and testing, but the development team remains the ultimate owner of that feature’s reliability. The SRE team might set the SLOs (Service Level Objectives), like the 99.99% availability, and build the monitoring to track them, but the development team is responsible for meeting those SLOs. This prevents the SRE team from becoming a bottleneck or a dumping ground for "undesirable" engineering tasks.

The size of an SRE team is often a function of the number of services they support and the complexity of those services. A common ratio is one SRE for every 10-20 production services, but this varies wildly. A service with high traffic, complex dependencies, or a history of instability will demand more SRE attention. For a team like auth-reliability supporting a critical user-auth-api, they might have 3-5 engineers, ensuring adequate coverage for on-call and project work.

One common misconception is that SREs are just glorified sysadmins. The reality is that SREs are software engineers who apply software engineering principles to operations problems. They write code to automate tasks, build tooling, and even contribute to product code to improve observability and resilience. For example, when the user-auth-api experiences intermittent network issues between pods, an SRE might write a small Go program that runs in the background, periodically testing connectivity and logging detailed metrics, which is far more robust than manual ping tests. This program would then be deployed as a sidecar or a separate monitoring agent.

The next challenge you’ll typically face is defining clear SLOs and error budgets for the services your SRE team supports.

Want structured learning?

Take the full Sre course →