The most surprising thing about SRE cost vs. reliability trade-offs is that the most expensive systems are often not the most reliable ones.
Let’s look at a hypothetical e-commerce platform. Imagine a critical "checkout" service. We want it to be highly available, let’s say 99.99% uptime. This translates to about 52 minutes of downtime per year.
Here’s how a transaction might flow:
- User clicks "Buy": Request hits the API Gateway.
- API Gateway: Routes to the
CheckoutService. CheckoutService:- Validates cart contents (calls
CartService). - Checks inventory (calls
InventoryService). - Processes payment (calls
PaymentGatewayexternal API). - Creates order record (writes to
OrderDB). - Notifies shipping (calls
ShippingService).
- Validates cart contents (calls
CartService: Reads cart data fromCartCache(Redis).InventoryService: Reads/writes inventory counts fromInventoryDB(Postgres).PaymentGateway: External, third-party payment processor.OrderDB: Primary database for orders.ShippingService: Publishes an event to a Kafka topic for downstream processing.
Now, what does "spending error budget wisely" mean in this context? It means being deliberate about when and how we accept unreliability to save money, and ensuring those savings don’t jeopardize our SLOs.
Consider the InventoryService. If it’s a bottleneck and we need to scale it up massively with expensive, high-CPU instances to meet peak demand, that’s a direct cost. But what if it’s occasionally slow or unavailable?
-
Scenario A: High Reliability, High Cost
- We provision 20
InventoryServiceinstances, each with 8 vCPUs and 32GB RAM, running on dedicated hardware. - We use a highly available, multi-master
InventoryDBwith active-active replication. - Cost: Very high.
- Reliability: Excellent, even under extreme load.
- We provision 20
-
Scenario B: Moderate Reliability, Moderate Cost (Error Budget in Play)
- We provision 10
InventoryServiceinstances, each with 4 vCPUs and 16GB RAM, running on standard instances. - We use a single-master
InventoryDBwith read replicas. - Cost: Significantly lower.
- Reliability: Good, but might experience occasional latency or brief unavailability during peak load or maintenance.
- We provision 10
If our SLO for InventoryService is 99.9% (about 8.7 hours of downtime per year), we have a larger error budget than the CheckoutService’s 99.99%. This means we can afford for InventoryService to be unavailable for longer periods. We might choose Scenario B, accepting that some small percentage of transactions might fail or be delayed due to inventory checks. The cost savings from using fewer, smaller instances and a less complex database are substantial. We then spend that saved money elsewhere, perhaps on better monitoring, more robust alerting for critical services, or on engineers to quickly resolve the actual incidents that consume our error budget.
The key is to align the cost of reliability with the impact of unreliability on the user experience and business goals. If a 5-second delay in inventory check means a user abandons their cart, that’s a critical failure. If it means they see "out of stock" for a minute and then refresh to see it’s back, that might be acceptable within our error budget.
Let’s say our InventoryService uses a PostgreSQL database. A common configuration might look like this:
apiVersion: apps/v1
kind: Deployment
metadata:
name: inventory-service
spec:
replicas: 5 # We'll start with 5 instances
selector:
matchLabels:
app: inventory
template:
metadata:
labels:
app: inventory
spec:
containers:
- name: inventory
image: my-repo/inventory-service:v1.2.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "500m" # 0.5 vCPU
memory: "1Gi" # 1 GB RAM
limits:
cpu: "1000m" # 1 vCPU
memory: "2Gi" # 2 GB RAM
env:
- name: DATABASE_URL
value: "postgresql://user:password@inventory-db.default.svc.cluster.local:5432/inventory"
---
apiVersion: v1
kind: Service
metadata:
name: inventory-service
spec:
selector:
app: inventory
ports:
- protocol: TCP
port: 80
targetPort: 8080
If this setup is too expensive but we still need to meet a 99.9% SLO, we might scale down the replicas and adjust resources:
# ... (previous metadata and selector)
template:
# ... (previous metadata)
spec:
containers:
- name: inventory
image: my-repo/inventory-service:v1.2.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "250m" # 0.25 vCPU
memory: "512Mi" # 0.5 GB RAM
limits:
cpu: "500m" # 0.5 vCPU
memory: "1Gi" # 1 GB RAM
# ... (rest of the service definition)
And perhaps downgrade the InventoryDB from a db.r5.xlarge instance to a db.r5.large, saving hundreds of dollars a month, accepting that this smaller instance might hit its IOPS limits during intense periods, causing latency and consuming our error budget.
The core principle is that per-service reliability is not a monolith. You don’t need 99.9999% uptime for every single component. By accepting slightly lower reliability for less critical or less user-facing services, you free up budget (both monetary and engineering effort) to invest in the absolute critical paths, ensuring they remain exceptionally stable. This also means that when an error does occur, the team is better equipped to address it because they aren’t constantly firefighting issues across a uniformly over-engineered system.
The next challenge is understanding how to dynamically adjust resources based on observed load and reliability metrics, rather than static configurations.