The Horizontal Pod Autoscaler (HPA) in Kubernetes, when configured for vLLM, doesn’t actually measure vLLM’s inference throughput directly; it relies on underlying metrics that correlate with vLLM’s load.
Let’s see this in action. Imagine we have a vLLM deployment serving requests, and we want it to scale up when it’s getting swamped.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
spec:
replicas: 1
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
ports:
- containerPort: 8000
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "1"
memory: "2Gi"
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model=meta-llama/Llama-2-7b-chat-hf"
- "--host=0.0.0.0"
- "--port=8000"
- "--tensor-parallel-size=1"
- "--max-model-len=4096"
- "--gpu-memory-utilization=0.9"
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm-inference
ports:
- protocol: TCP
port: 80
targetPort: 8000
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
minReplicas: 1
maxReplicas: 5
targetCPUUtilizationPercentage: 70
This HPA configuration targets targetCPUUtilizationPercentage: 70. When the average CPU utilization across all vLLM pods exceeds 70%, the HPA will try to create more pods. Conversely, if it drops below that, it will scale down.
The problem is, vLLM is heavily GPU-bound. While it uses CPU for some tasks (like request parsing and batching logic), its primary bottleneck for inference speed is GPU compute and memory bandwidth. If you only configure CPU-based scaling, your HPA might not react quickly enough, or at all, to actual inference load.
To effectively scale vLLM on Kubernetes, you need to leverage GPU metrics. This typically involves setting up the Kubernetes device-plugin for your GPU vendor (NVIDIA, AMD) and then configuring the HPA to use GPU utilization or requests.
Here’s a more appropriate HPA configuration using GPU utilization:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-gpu-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: nvidia.com/gpu # Or amd.com/gpu, depending on your hardware
target:
type: Utilization
averageUtilization: 80 # Target 80% GPU utilization
This HPA configuration targets nvidia.com/gpu (or your specific GPU resource) and aims for an averageUtilization of 80%. This means the HPA will watch the actual GPU utilization of your vLLM pods. When the average GPU utilization across all pods goes above 80%, it will trigger a scale-up event by creating new vLLM pods. When it drops below, it will scale down.
For this to work, your Kubernetes nodes must have the appropriate GPU device plugin installed and running, and your vLLM pods must be requesting GPUs. The resources.limits and resources.requests in the deployment manifest should include your GPU resource. For example:
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU per pod
requests:
nvidia.com/gpu: 1 # Request 1 GPU per pod
The averageUtilization for GPUs is calculated based on the total allocated GPU resources (limits.nvidia.com/gpu) for a pod. If a pod requests 1 GPU and its utilization is 70%, the HPA sees 0.7 * 1 = 0.7 GPUs used. If it requests 2 GPUs and uses 70% of both, it sees 1.4 GPUs used. The HPA then sums these up across all pods and compares it to the sum of requested GPUs across all pods (scaled by the target utilization percentage) to determine if a scale-up or scale-down is needed.
The crucial insight is that the HPA doesn’t understand vLLM’s internal state (like queue length or tokens per second). It only sees resource utilization as reported by the Kubernetes metrics server, which, when configured correctly with GPU device plugins, provides a direct proxy for inference load.
The next challenge you’ll face is tuning the target values (e.g., averageUtilization: 80) and maxReplicas to find a sweet spot that balances performance, cost, and responsiveness without causing excessive flapping or over-provisioning.