The most surprising thing about SRE cost optimization is that the biggest wins often come from doing less of what you’re already doing, not from finding a magic bullet technology.
Let’s watch a typical service, user-auth, scale up and down on Kubernetes. This service has a deployment, user-auth-deployment, and a horizontal pod autoscaler (HPA), user-auth-hpa.
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-auth-deployment
spec:
replicas: 3 # Starting point
template:
spec:
containers:
- name: user-auth
image: my-registry/user-auth:v1.2.0
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "400m"
memory: "512Mi"
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: user-auth-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: user-auth-deployment
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Target 70% CPU utilization
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Target 80% memory utilization
When traffic spikes, user-auth pods start consuming more CPU. The HPA monitors kube-state-metrics for the average CPU utilization across all user-auth pods. If it exceeds 70%, the HPA increases replicas in user-auth-deployment. Conversely, if utilization drops, it scales down. The memory metric acts as a secondary trigger.
This automatic scaling prevents over-provisioning during low-traffic periods and ensures availability during peaks, but it’s only part of the story. The real cost optimization comes from understanding the drivers of that scaling and resource consumption.
The core problem SRE cost optimization solves is the mismatch between provisioned capacity and actual demand, leading to wasted spend on idle or underutilized resources. It’s about aligning infrastructure costs directly with the value your services deliver. This isn’t just about "turn it off when nobody’s using it" – it’s a systematic approach to resource management.
Here’s how to build the mental model:
- Visibility is Paramount: You can’t optimize what you can’t see. This means detailed metrics on resource usage (CPU, memory, network, disk) per pod, per deployment, per namespace, and crucially, per service. Tools like Prometheus, Grafana, and cloud provider monitoring suites are your eyes. You need to see request latency, error rates, and throughput alongside resource consumption.
- Cost Allocation: Tie infrastructure costs back to specific services or teams. Cloud provider billing reports, often augmented with Kubernetes labels and annotations, are essential. If you can’t assign a dollar amount to a service’s infrastructure, you can’t effectively optimize it.
- Right-Sizing Resources: This is the bread and butter. The
requestsandlimitsin your pod specs are critical.requeststell the Kubernetes scheduler how much CPU/memory a pod guarantees it will need, and is used for scheduling decisions.limitsdefine the maximum. Ifrequestsare too high, you waste resources. Iflimitsare too low, you risk pod evictions or OOMKills.- Diagnosis: Use tools like
kube-state-metricsandvpa-recommender(Vertical Pod Autoscaler).kubectl top pods -n <namespace>gives current usage.kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources.requests.cpu}{"\t"}{.spec.containers[0].resources.limits.cpu}{"\n"}{end}'shows current requests/limits. - Action: If
user-authpods consistently use 50m CPU but haverequests: {cpu: "200m"}, you can reducerequeststo100m. This frees up the node for other pods. If pods are hittinglimits: {cpu: "400m"}and showing high CPU, you might need to increase the limit or scale out more pods.
- Diagnosis: Use tools like
- Autoscaling Tuning: The HPA
targetvalues are key. If youraverageUtilizationis set to 70%, the HPA will scale up when pods are at 70% of their requested CPU. If your requests are too high, the pods might be at 20% of their actual usage but 70% of their requested usage, leading to unnecessary scaling.- Diagnosis: Observe HPA events (
kubectl get hpa -n <namespace> user-auth-hpa -o yaml) and compare thetargetutilization with actual pod CPU usage metrics from Prometheus. - Action: If actual CPU usage is consistently below 50%, but the HPA is scaling up because
averageUtilizationis hitting 70% (due to highrequests), lower thetargetto50or adjust therequestson the pods.
- Diagnosis: Observe HPA events (
- Node Utilization: Even if your pods are well-sized, the underlying nodes might be over-provisioned. If your nodes are consistently running at 30% CPU utilization, you can likely reduce the number of nodes.
- Diagnosis: Monitor node resource utilization in your cluster. Tools like
deschedulercan help identify underutilized nodes. - Action: Reduce the number of nodes in your node group or cluster autoscaler configuration.
- Diagnosis: Monitor node resource utilization in your cluster. Tools like
- Spot Instances/Preemptible VMs: For stateless, fault-tolerant workloads, using cheaper, interruptible spot instances can dramatically cut costs.
- Action: Configure your cluster autoscaler or node groups to provision spot instances. Ensure your applications can handle pods being terminated.
- Storage Optimization: Unused or oversized persistent volumes (PVs) are a common hidden cost.
- Diagnosis: Regularly audit PVs. Identify PVs that are not attached to any pods or are significantly larger than their actual usage.
- Action: Delete unused PVs. For oversized PVs, if your storage class supports it, resize them. Otherwise, migrate data to a correctly sized PV.
- Network Traffic: Egress traffic, especially from cloud providers, can be expensive. Optimize data transfer patterns.
- Action: Use private endpoints where possible, compress data, or use CDNs for static assets.
The counterintuitive part of cost optimization is that aggressively lowering resource requests can sometimes increase overall cluster cost temporarily. If you set requests too low on many pods, they might all get scheduled onto fewer nodes, leading to those nodes becoming saturated. This saturation can trigger the cluster autoscaler to add more nodes than you actually need for the peak load, because the scheduler is trying to accommodate the high aggregate requests across all pods, even if their actual current usage is low. The key is to balance requests for efficient scheduling with limits and actual usage for efficient resource consumption.
The next frontier after optimizing basic resource utilization and scaling is often implementing efficient data tiering and caching strategies to reduce expensive I/O operations and network egress.