Capacity planning is less about predicting the future and more about understanding the rate at which your system’s future will arrive.
Let’s say you’re building a new microservice that processes user profile updates. Your current load is 100 requests per second (RPS), and each request takes 50 milliseconds (ms) of CPU time.
Here’s what that looks like in action. Imagine you have 10 instances of this service running, each with 2 CPU cores.
# Current State
Total RPS: 100
RPS per instance: 100 RPS / 10 instances = 10 RPS/instance
CPU time per request: 50 ms
Total CPU time used per instance: 10 RPS/instance * 50 ms/request = 500 ms
# CPU Utilization (per core)
CPU time available per instance: 2 cores * 1000 ms/core = 2000 ms
CPU utilization per core: 500 ms / 2 cores = 25%
This tells you you’re comfortably underutilized. But what happens next quarter when your user base doubles?
Your primary goal in capacity planning is to map your expected load to the resources required to serve it, ensuring performance and availability targets are met. This involves modeling the relationship between incoming requests, the work each request does, and the underlying infrastructure.
The Load Model: This is your best guess about future traffic. It’s not just an average; it’s about peaks and sustained high-volume periods. For our profile service, instead of just 100 RPS, we might model:
- Average Load: 100 RPS
- Peak Load: 500 RPS (e.g., during a marketing campaign)
- Growth Rate: 10% month-over-month
The Resource Model: This quantifies how much work each request requires from your system’s components (CPU, memory, network, disk I/O).
- CPU: We already know a profile update takes 50ms of CPU time.
- Memory: Each instance might need 512MB of RAM to operate.
- Network: Each request might send 1KB of data over the network.
- Disk: If it writes to a database, how many IOPS does that translate to?
Now, let’s project our profile service. If we expect 500 RPS during peak, and each request still takes 50ms of CPU, how many instances do we need?
# Future State (Peak Load)
Target RPS: 500 RPS
CPU time per request: 50 ms
CPU time needed per instance: 500 RPS / (Number of Instances) * 50 ms
# Let's assume we want to keep utilization below 70% to handle spikiness
Target CPU utilization per core: 70%
CPU available per core: 2 cores * 1000 ms/core = 2000 ms
Max CPU time per instance to stay at 70%: 2000 ms * 0.70 = 1400 ms
# How many RPS can one instance handle at 70% utilization?
Max RPS per instance: 1400 ms / 50 ms/request = 28 RPS/instance
# How many instances do we need for 500 RPS?
Required Instances: 500 RPS / 28 RPS/instance = 17.8 -> 18 instances
So, to handle a peak of 500 RPS while staying at 70% CPU utilization on 2-core instances, you’d need 18 instances.
This calculation is the core loop: Estimate Load -> Measure Resource Consumption -> Calculate Required Resources -> Provision Resources. You repeat this for every critical component.
The most surprising thing about this process is how often the bottleneck shifts unexpectedly. You might provision for CPU, only to find that your network egress or database connection pool is saturated first. It’s a continuous process of identifying the weakest link, reinforcing it, and then finding the next weakest link. This is why real-time monitoring and automated scaling are so crucial; they provide the feedback loop needed to adapt as the system operates.
Understanding the linearity of your system’s resource consumption is key. If doubling requests doesn’t quite double resource usage, you’ve found an efficiency. If it more than doubles it, you’ve found an overhead or a shared resource contention that needs investigation.
Once you’ve planned for peak load, you’ll need to consider how to handle the "thundering herd" problem during deployments.