The most surprising thing about SRE capacity planning is that it’s not about predicting the future, but about understanding the present so well that the future becomes irrelevant.
Let’s watch a Kubernetes cluster handle a sudden load spike, demonstrating the principles of right-sizing. Imagine we have a web service deployed across three pods, each requesting 2 CPU cores and 4 GiB of memory.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-web-app
spec:
replicas: 3
selector:
matchLabels:
app: my-web-app
template:
metadata:
labels:
app: my-web-app
spec:
containers:
- name: app-container
image: nginx:latest
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
This Deployment tells Kubernetes to maintain 3 pods. Each pod is configured with requests for 2 CPU cores and 4 GiB of memory. This is what the Kubernetes scheduler uses to decide where to place the pods. The limits are the maximum resources a pod is allowed to consume before being throttled or terminated.
Now, let’s simulate a load increase. We’ll use hey to send 1000 requests per second to our service.
hey -n 100000 -c 1000 http://my-web-app.default.svc.cluster.local
If our pods were under-requested, they might have been scheduled onto nodes that are already strained. When the load hits, the application inside the pod might not get the CPU it needs to process requests quickly. It starts queuing internally, latency spikes, and error rates climb. Kubernetes, seeing the actual usage (not the requested amount) exceed the node’s capacity, might start evicting pods from that node to protect its own stability.
If the pods were over-limited, they might hit their CPU or memory limits. A CPU limit means the kernel will artificially slow down the process. A memory limit will cause the Out-Of-Memory (OOM) killer to terminate the process. Even if the node has plenty of resources, the individual pod is artificially constrained.
Right-sizing means setting requests and limits to match the application’s actual, sustained needs under a typical peak load, with a small buffer. This ensures pods are scheduled onto healthy, adequately resourced nodes. It also prevents applications from consuming more than their fair share or being killed by the OOM killer unnecessarily. For our my-web-app, if monitoring shows it reliably uses around 1.5 CPU and 3 GiB memory under peak load, we might adjust requests to cpu: "2", memory: "4Gi" and limits to cpu: "3", memory: "6Gi". This gives a small buffer without over-allocating.
The problem this solves is avoiding both the "noisy neighbor" problem (one over-consuming pod impacting others) and the "starvation" problem (a pod not getting enough resources to function). It’s about efficient resource utilization and application stability.
The key to right-sizing is continuous monitoring and analysis of resource utilization metrics: CPU usage (not just utilization percentage, but actual millicores or cores), memory usage (RSS, working set), network I/O, and disk I/O. Tools like Prometheus, Grafana, and the Kubernetes Metrics Server are your best friends here. You look at percentiles (p95, p99) of usage during peak periods, not just averages, and set your requests to a value slightly above your p95 usage, and your limits to a value that allows for occasional, short bursts beyond your sustained peak but prevents runaway consumption.
Many teams focus only on CPU. They forget that memory pressure, even if it doesn’t trigger the OOM killer, can cause the kernel to start swapping or aggressively page out less-used memory, severely impacting application performance. If your application has a large memory footprint or handles many concurrent connections, memory right-sizing is just as critical as CPU.
The next challenge is understanding how to provision the underlying nodes themselves to meet the aggregate resource requests of all the pods you plan to run.