vLLM, the lightning-fast inference engine, can be a bit prickly when you try to wrangle it into production with Docker and Kubernetes.
Let’s get this vLLM Docker deployment humming in Kubernetes.
The core issue when vLLM misbehaves in a Kubernetes pod is usually a mismatch between what the vLLM process thinks it has for resources and what Kubernetes is actually giving it. This often manifests as the pod crashing with OOMKilled or simply not starting, but the root cause is deeper than just "not enough RAM."
Here are the usual suspects and how to fix them:
-
GPU Memory Allocation Mismatch: vLLM is heavily reliant on GPU memory. If Kubernetes doesn’t correctly allocate and report the GPU memory to the container, vLLM will try to use more than it has, leading to crashes.
- Diagnosis: Inside the pod, run
nvidia-smi. Compare the reported total GPU memory with what you expect for your node. If it’s significantly less, or zero, this is the problem. - Fix: Ensure your Kubernetes node has the NVIDIA device plugin installed and running correctly. Your pod’s
resources.limitssection must explicitly request GPUs and their memory. For example:resources: limits: nvidia.com/gpu: 1 # Explicitly request memory if your NVIDIA device plugin supports it # (This is less common than relying on the plugin to expose total memory) # nvidia.com/gpu.memory: 40Gi - Why it works: The NVIDIA device plugin registers GPU devices with Kubernetes, making them available for scheduling. Requesting
nvidia.com/gpuin the pod spec tells Kubernetes to schedule the pod onto a node with an available GPU and, crucially, to configure the container runtime (like containerd or CRI-O) to expose that GPU to the container. Thenvidia-smicommand inside the container should then reflect the correct total memory.
- Diagnosis: Inside the pod, run
-
Incorrect
VLLM_MAX_NUM_SEQSorVLLM_MAX_BATCH_SIZE: vLLM uses these environment variables to pre-allocate memory. If they’re set too high for the available GPU memory, it will fail to initialize.- Diagnosis: Check your pod logs for errors related to memory allocation during vLLM startup. Look for messages like "out of memory" or specific vLLM allocation failures.
- Fix: Lower these values. Start conservatively. For a 40GB GPU,
VLLM_MAX_NUM_SEQS=1024andVLLM_MAX_BATCH_SIZE=1024might be reasonable starting points, but this depends heavily on your model. You’ll need to experiment.env: - name: VLLM_MAX_NUM_SEQS value: "1024" - name: VLLM_MAX_BATCH_SIZE value: "1024" - Why it works: These parameters control internal memory buffers within vLLM. By reducing them, you decrease the initial memory footprint vLLM attempts to claim, allowing it to start successfully within the bounds of the GPU memory made available by Kubernetes.
-
Container Runtime GPU Configuration: The container runtime (Docker, containerd, CRI-O) needs to be configured to pass the GPU through to the container. This is usually handled by the NVIDIA device plugin, but misconfiguration can occur.
- Diagnosis: Check the Kubernetes node’s container runtime logs. For containerd, you might look at
/var/log/containerd/containerd.log. Look for errors related to GPU device access ornvidia-container-runtimeissues. - Fix: Ensure the NVIDIA container runtime is correctly installed and configured as the default runtime for your container engine. This often involves modifying
config.tomlfor containerd ordaemon.jsonfor Docker.- containerd: In
/etc/containerd/config.toml, under[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia], ensureruntime_type = "io.containerd.runtime.v1.linux"andbase_runtime_name = "nvidia-container-runtime"are set. Then, in[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options], setBinaryName = "/usr/bin/runc". Restart containerd. - Docker: In
/etc/docker/daemon.json, ensure:
Then restart Docker:{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtime-args": [] } } }sudo systemctl restart docker.
- containerd: In
- Why it works: This step explicitly tells the container runtime to use the NVIDIA runtime when a container requests GPU access, which correctly sets up the necessary device mappings and environment variables for CUDA.
- Diagnosis: Check the Kubernetes node’s container runtime logs. For containerd, you might look at
-
Insufficient Node Resources (CPU/RAM): While GPU memory is primary, vLLM also needs CPU and system RAM for its Python interpreter, CUDA operations, and general processing. If the node is starved, pods can be evicted or the vLLM process might just hang or crash.
- Diagnosis: Use
kubectl top nodesto check CPU and memory utilization on the nodes where your vLLM pods are scheduled. Also, checkkubectl describe pod <pod-name>for anyOOMKilledstatus or other resource-related events. - Fix:
- Request more resources in the pod spec:
resources: requests: cpu: "2" memory: "8Gi" limits: cpu: "4" memory: "16Gi" nvidia.com/gpu: 1 - Scale up your Kubernetes nodes: Add more nodes to your cluster or upgrade existing nodes to have more CPU and RAM.
- Request more resources in the pod spec:
- Why it works: By requesting adequate CPU and RAM, you ensure the Kubernetes scheduler assigns your pod to a node that can handle its non-GPU demands, and you prevent the node from becoming overloaded, which could lead to the pod being terminated by the system.
- Diagnosis: Use
-
Outdated NVIDIA Drivers or CUDA Toolkit: vLLM is built against specific versions of CUDA and relies on compatible NVIDIA drivers. Incompatibilities here are a classic source of "it works on my machine but not in the cluster" issues.
- Diagnosis: Check the NVIDIA driver version on your Kubernetes nodes (
nvidia-smi). Inside the pod, check the CUDA version available to the container (often printed in vLLM logs or by runningnvcc --versionif the CUDA toolkit is installed in the image). - Fix: Ensure the NVIDIA driver version on your nodes is compatible with the CUDA version used to build your vLLM container image. The NVIDIA documentation provides compatibility matrices.
- Node Driver: Update drivers on your nodes if necessary.
- Container Image: Build your Docker image using a base image that has a compatible CUDA toolkit, or ensure the CUDA toolkit installed within your image matches the driver. For example, if your nodes have driver 535.x, you might use a CUDA 12.2 toolkit.
- Why it works: The CUDA toolkit and the NVIDIA driver work together to enable GPU functionality. Mismatched versions can lead to subtle errors in GPU memory management, kernel execution, or device discovery, all of which can manifest as vLLM failures.
- Diagnosis: Check the NVIDIA driver version on your Kubernetes nodes (
-
Incorrect Docker Image Base: Using a standard Ubuntu or Debian base image and installing CUDA/vLLM manually can lead to subtle path or library versioning issues that the official NVIDIA CUDA base images avoid.
- Diagnosis: Examine your Dockerfile. Are you starting from
ubuntu:22.04and then trying to install CUDA? - Fix: Start from an NVIDIA CUDA base image that matches your target CUDA version. For example:
FROM nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04 # Install Python, pip, etc. RUN apt-get update && apt-get install -y --no-install-recommends \ python3 \ python3-pip \ # ... other dependencies # Install vLLM RUN pip3 install vllm # ... rest of your Dockerfile - Why it works: NVIDIA’s base images come with the CUDA toolkit and necessary libraries pre-installed and correctly configured, eliminating many common dependency conflicts that arise when manually assembling the environment.
- Diagnosis: Examine your Dockerfile. Are you starting from
The next error you’ll likely encounter after fixing these is a Readiness probe failed if your application isn’t exposing a health check endpoint, or a 503 Service Unavailable if the ingress controller can’t reach the healthy pod.