The most surprising thing about SRE cost optimization is that the biggest wins often come from doing less of what you’re already doing, not from finding a magic bullet technology.

Let’s watch a typical service, user-auth, scale up and down on Kubernetes. This service has a deployment, user-auth-deployment, and a horizontal pod autoscaler (HPA), user-auth-hpa.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-auth-deployment
spec:
  replicas: 3 # Starting point
  template:
    spec:
      containers:
      - name: user-auth
        image: my-registry/user-auth:v1.2.0
        resources:
          requests:
            cpu: "200m"
            memory: "256Mi"
          limits:
            cpu: "400m"
            memory: "512Mi"
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: user-auth-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-auth-deployment
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70 # Target 70% CPU utilization
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80 # Target 80% memory utilization

When traffic spikes, user-auth pods start consuming more CPU. The HPA monitors kube-state-metrics for the average CPU utilization across all user-auth pods. If it exceeds 70%, the HPA increases replicas in user-auth-deployment. Conversely, if utilization drops, it scales down. The memory metric acts as a secondary trigger.

This automatic scaling prevents over-provisioning during low-traffic periods and ensures availability during peaks, but it’s only part of the story. The real cost optimization comes from understanding the drivers of that scaling and resource consumption.

The core problem SRE cost optimization solves is the mismatch between provisioned capacity and actual demand, leading to wasted spend on idle or underutilized resources. It’s about aligning infrastructure costs directly with the value your services deliver. This isn’t just about "turn it off when nobody’s using it" – it’s a systematic approach to resource management.

Here’s how to build the mental model:

  1. Visibility is Paramount: You can’t optimize what you can’t see. This means detailed metrics on resource usage (CPU, memory, network, disk) per pod, per deployment, per namespace, and crucially, per service. Tools like Prometheus, Grafana, and cloud provider monitoring suites are your eyes. You need to see request latency, error rates, and throughput alongside resource consumption.
  2. Cost Allocation: Tie infrastructure costs back to specific services or teams. Cloud provider billing reports, often augmented with Kubernetes labels and annotations, are essential. If you can’t assign a dollar amount to a service’s infrastructure, you can’t effectively optimize it.
  3. Right-Sizing Resources: This is the bread and butter. The requests and limits in your pod specs are critical. requests tell the Kubernetes scheduler how much CPU/memory a pod guarantees it will need, and is used for scheduling decisions. limits define the maximum. If requests are too high, you waste resources. If limits are too low, you risk pod evictions or OOMKills.
    • Diagnosis: Use tools like kube-state-metrics and vpa-recommender (Vertical Pod Autoscaler). kubectl top pods -n <namespace> gives current usage. kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources.requests.cpu}{"\t"}{.spec.containers[0].resources.limits.cpu}{"\n"}{end}' shows current requests/limits.
    • Action: If user-auth pods consistently use 50m CPU but have requests: {cpu: "200m"}, you can reduce requests to 100m. This frees up the node for other pods. If pods are hitting limits: {cpu: "400m"} and showing high CPU, you might need to increase the limit or scale out more pods.
  4. Autoscaling Tuning: The HPA target values are key. If your averageUtilization is set to 70%, the HPA will scale up when pods are at 70% of their requested CPU. If your requests are too high, the pods might be at 20% of their actual usage but 70% of their requested usage, leading to unnecessary scaling.
    • Diagnosis: Observe HPA events (kubectl get hpa -n <namespace> user-auth-hpa -o yaml) and compare the target utilization with actual pod CPU usage metrics from Prometheus.
    • Action: If actual CPU usage is consistently below 50%, but the HPA is scaling up because averageUtilization is hitting 70% (due to high requests), lower the target to 50 or adjust the requests on the pods.
  5. Node Utilization: Even if your pods are well-sized, the underlying nodes might be over-provisioned. If your nodes are consistently running at 30% CPU utilization, you can likely reduce the number of nodes.
    • Diagnosis: Monitor node resource utilization in your cluster. Tools like descheduler can help identify underutilized nodes.
    • Action: Reduce the number of nodes in your node group or cluster autoscaler configuration.
  6. Spot Instances/Preemptible VMs: For stateless, fault-tolerant workloads, using cheaper, interruptible spot instances can dramatically cut costs.
    • Action: Configure your cluster autoscaler or node groups to provision spot instances. Ensure your applications can handle pods being terminated.
  7. Storage Optimization: Unused or oversized persistent volumes (PVs) are a common hidden cost.
    • Diagnosis: Regularly audit PVs. Identify PVs that are not attached to any pods or are significantly larger than their actual usage.
    • Action: Delete unused PVs. For oversized PVs, if your storage class supports it, resize them. Otherwise, migrate data to a correctly sized PV.
  8. Network Traffic: Egress traffic, especially from cloud providers, can be expensive. Optimize data transfer patterns.
    • Action: Use private endpoints where possible, compress data, or use CDNs for static assets.

The counterintuitive part of cost optimization is that aggressively lowering resource requests can sometimes increase overall cluster cost temporarily. If you set requests too low on many pods, they might all get scheduled onto fewer nodes, leading to those nodes becoming saturated. This saturation can trigger the cluster autoscaler to add more nodes than you actually need for the peak load, because the scheduler is trying to accommodate the high aggregate requests across all pods, even if their actual current usage is low. The key is to balance requests for efficient scheduling with limits and actual usage for efficient resource consumption.

The next frontier after optimizing basic resource utilization and scaling is often implementing efficient data tiering and caching strategies to reduce expensive I/O operations and network egress.

Want structured learning?

Take the full Sre course →