vLLM Docker and Kubernetes Production Deployment (2026)

vLLM, the lightning-fast inference engine, can be a bit prickly when you try to wrangle it into production with Docker and Kubernetes.

Let’s get this vLLM Docker deployment humming in Kubernetes.

The core issue when vLLM misbehaves in a Kubernetes pod is usually a mismatch between what the vLLM process thinks it has for resources and what Kubernetes is actually giving it. This often manifests as the pod crashing with OOMKilled or simply not starting, but the root cause is deeper than just "not enough RAM."

Here are the usual suspects and how to fix them:

GPU Memory Allocation Mismatch: vLLM is heavily reliant on GPU memory. If Kubernetes doesn’t correctly allocate and report the GPU memory to the container, vLLM will try to use more than it has, leading to crashes.
- Diagnosis: Inside the pod, run nvidia-smi. Compare the reported total GPU memory with what you expect for your node. If it’s significantly less, or zero, this is the problem.
- Fix: Ensure your Kubernetes node has the NVIDIA device plugin installed and running correctly. Your pod’s resources.limits section must explicitly request GPUs and their memory. For example:
```
resources:
  limits:
    nvidia.com/gpu: 1
    # Explicitly request memory if your NVIDIA device plugin supports it
    # (This is less common than relying on the plugin to expose total memory)
    # nvidia.com/gpu.memory: 40Gi
```
- Why it works: The NVIDIA device plugin registers GPU devices with Kubernetes, making them available for scheduling. Requesting nvidia.com/gpu in the pod spec tells Kubernetes to schedule the pod onto a node with an available GPU and, crucially, to configure the container runtime (like containerd or CRI-O) to expose that GPU to the container. The nvidia-smi command inside the container should then reflect the correct total memory.
Incorrect VLLM_MAX_NUM_SEQS or VLLM_MAX_BATCH_SIZE: vLLM uses these environment variables to pre-allocate memory. If they’re set too high for the available GPU memory, it will fail to initialize.
- Diagnosis: Check your pod logs for errors related to memory allocation during vLLM startup. Look for messages like "out of memory" or specific vLLM allocation failures.
- Fix: Lower these values. Start conservatively. For a 40GB GPU, VLLM_MAX_NUM_SEQS=1024 and VLLM_MAX_BATCH_SIZE=1024 might be reasonable starting points, but this depends heavily on your model. You’ll need to experiment.
```
env:
  - name: VLLM_MAX_NUM_SEQS
    value: "1024"
  - name: VLLM_MAX_BATCH_SIZE
    value: "1024"
```
- Why it works: These parameters control internal memory buffers within vLLM. By reducing them, you decrease the initial memory footprint vLLM attempts to claim, allowing it to start successfully within the bounds of the GPU memory made available by Kubernetes.
Container Runtime GPU Configuration: The container runtime (Docker, containerd, CRI-O) needs to be configured to pass the GPU through to the container. This is usually handled by the NVIDIA device plugin, but misconfiguration can occur.
- Diagnosis: Check the Kubernetes node’s container runtime logs. For containerd, you might look at /var/log/containerd/containerd.log. Look for errors related to GPU device access or nvidia-container-runtime issues.
- Fix: Ensure the NVIDIA container runtime is correctly installed and configured as the default runtime for your container engine. This often involves modifying config.toml for containerd or daemon.json for Docker.
  - containerd: In /etc/containerd/config.toml, under [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia], ensure runtime_type = "io.containerd.runtime.v1.linux" and base_runtime_name = "nvidia-container-runtime" are set. Then, in [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options], set BinaryName = "/usr/bin/runc". Restart containerd.
  - Docker: In /etc/docker/daemon.json, ensure:
```
{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtime-args": []
    }
  }
}
```
    Then restart Docker: sudo systemctl restart docker.
- Why it works: This step explicitly tells the container runtime to use the NVIDIA runtime when a container requests GPU access, which correctly sets up the necessary device mappings and environment variables for CUDA.
Insufficient Node Resources (CPU/RAM): While GPU memory is primary, vLLM also needs CPU and system RAM for its Python interpreter, CUDA operations, and general processing. If the node is starved, pods can be evicted or the vLLM process might just hang or crash.
- Diagnosis: Use kubectl top nodes to check CPU and memory utilization on the nodes where your vLLM pods are scheduled. Also, check kubectl describe pod <pod-name> for any OOMKilled status or other resource-related events.
- Fix:
  - Request more resources in the pod spec:
```
resources:
  requests:
    cpu: "2"
    memory: "8Gi"
  limits:
    cpu: "4"
    memory: "16Gi"
    nvidia.com/gpu: 1
```
  - Scale up your Kubernetes nodes: Add more nodes to your cluster or upgrade existing nodes to have more CPU and RAM.
- Why it works: By requesting adequate CPU and RAM, you ensure the Kubernetes scheduler assigns your pod to a node that can handle its non-GPU demands, and you prevent the node from becoming overloaded, which could lead to the pod being terminated by the system.
Outdated NVIDIA Drivers or CUDA Toolkit: vLLM is built against specific versions of CUDA and relies on compatible NVIDIA drivers. Incompatibilities here are a classic source of "it works on my machine but not in the cluster" issues.
- Diagnosis: Check the NVIDIA driver version on your Kubernetes nodes (nvidia-smi). Inside the pod, check the CUDA version available to the container (often printed in vLLM logs or by running nvcc --version if the CUDA toolkit is installed in the image).
- Fix: Ensure the NVIDIA driver version on your nodes is compatible with the CUDA version used to build your vLLM container image. The NVIDIA documentation provides compatibility matrices.
  - Node Driver: Update drivers on your nodes if necessary.
  - Container Image: Build your Docker image using a base image that has a compatible CUDA toolkit, or ensure the CUDA toolkit installed within your image matches the driver. For example, if your nodes have driver 535.x, you might use a CUDA 12.2 toolkit.
- Why it works: The CUDA toolkit and the NVIDIA driver work together to enable GPU functionality. Mismatched versions can lead to subtle errors in GPU memory management, kernel execution, or device discovery, all of which can manifest as vLLM failures.
Incorrect Docker Image Base: Using a standard Ubuntu or Debian base image and installing CUDA/vLLM manually can lead to subtle path or library versioning issues that the official NVIDIA CUDA base images avoid.
- Diagnosis: Examine your Dockerfile. Are you starting from ubuntu:22.04 and then trying to install CUDA?
- Fix: Start from an NVIDIA CUDA base image that matches your target CUDA version. For example:
```
FROM nvidia/cuda:12.2.2-cudnn8-runtime-ubuntu22.04

# Install Python, pip, etc.
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 \
    python3-pip \
    # ... other dependencies

# Install vLLM
RUN pip3 install vllm

# ... rest of your Dockerfile
```
- Why it works: NVIDIA’s base images come with the CUDA toolkit and necessary libraries pre-installed and correctly configured, eliminating many common dependency conflicts that arise when manually assembling the environment.

The next error you’ll likely encounter after fixing these is a Readiness probe failed if your application isn’t exposing a health check endpoint, or a 503 Service Unavailable if the ingress controller can’t reach the healthy pod.

More Deep Dives in Vllm