The vLLM async engine lets you serve large language models with impressive throughput, but getting it into production feels like navigating a minefield of subtle misconfigurations.

Let’s look at a typical setup. Imagine you’ve got a beefy server, say, an NVIDIA A100 with 80GB of VRAM. You want to serve a 70B parameter model.

python -m vllm.entrypoints.api_server \
    --model facebook/opt-6.7b \
    --tensor-parallel-size 1 \
    --dtype float16 \
    --max-num-seqs 1024 \
    --gpu-memory-utilization 0.9 \
    --port 8000 \
    --host 0.0.0.0

This looks straightforward, but the devil is in the details, especially when you scale up or integrate with other services.

The "Too Many Open Files" Abyss

One of the most common, and frustrating, errors you’ll hit is OSError: [Errno 24] Too many open files. This isn’t a vLLM bug; it’s a fundamental operating system limit. vLLM, especially with its async nature and internal caching, can create a lot of file descriptors.

Diagnosis:

Run ulimit -n on the server where vLLM is running. If the output is low (e.g., 1024, 4096), that’s your culprit. You can also check the current open file count for the vLLM process:

lsof -p $(pgrep -f "vllm.entrypoints.api_server") | wc -l

Fix:

You need to increase the system-wide and user-specific limits. Edit /etc/security/limits.conf and add these lines:

* soft nofile 65536
* hard nofile 65536
root soft nofile 65536
root hard nofile 65536

Then, for the current session (or until reboot), you can apply it with:

ulimit -n 65536

If you’re running vLLM within a Docker container, you’ll need to set these limits when running the container:

docker run --ulimit nofile=65536:65536 ... your_vllm_image ...

Why it works: This increases the maximum number of file descriptors (which includes network sockets, pipes, and actual files) that a process can open, preventing the OS from refusing new connections or operations.

The "CUDA out of memory" Ghost

This is the classic. You’ve loaded your model, and suddenly, requests start failing with CUDA error: out of memory. It’s not always about just fitting the model weights.

Diagnosis:

Monitor GPU memory usage with nvidia-smi. Look for spikes when requests come in, especially if you’re using a large batch size or long sequences.

Common Causes & Fixes:

  1. Insufficient gpu-memory-utilization:

    • Diagnosis: nvidia-smi shows high usage, but your gpu-memory-utilization is set too low (e.g., 0.7) for your model and expected load.
    • Fix: Increase gpu-memory-utilization to 0.9 or 0.95.
      python -m vllm.entrypoints.api_server \
          --model ... \
          --gpu-memory-utilization 0.95 \
          ...
      
    • Why it works: This tells vLLM to aggressively reserve more of the GPU’s VRAM for its internal PagedAttention mechanism and KV cache, leaving less room for other processes but ensuring vLLM has what it needs.
  2. max-num-seqs Too High:

    • Diagnosis: You’re seeing OOM errors even with high gpu-memory-utilization. The KV cache size is directly proportional to max-num-seqs and sequence length.
    • Fix: Decrease max-num-seqs. For an 80GB A100 and a 70B model, you might need to start with 512 or even 256 and tune upwards.
      python -m vllm.entrypoints.api_server \
          --model ... \
          --max-num-seqs 512 \
          ...
      
    • Why it works: Each sequence in the KV cache consumes memory. Reducing the maximum number of concurrent sequences directly cuts down the KV cache footprint.
  3. max-model-len Too High:

    • Diagnosis: Even with few sequences, you get OOM. The model’s maximum context window is larger than necessary for your use case.
    • Fix: Set max-model-len to your actual maximum expected sequence length. If your prompts and generations are usually under 2048 tokens, set it to 2048.
      python -m vllm.entrypoints.api_server \
          --model ... \
          --max-model-len 2048 \
          ...
      
    • Why it works: This limits the size of the KV cache per sequence, preventing excessively long (and memory-hungry) cache entries.
  4. Tensor Parallelism Misconfiguration:

    • Diagnosis: When using tensor-parallel-size > 1, memory might be unevenly distributed or insufficient on individual GPUs.
    • Fix: Ensure tensor-parallel-size matches your GPU count if you want to distribute the model across all GPUs. If you have 4 GPUs and want to use them all for one model instance:
      python -m vllm.entrypoints.api_server \
          --model ... \
          --tensor-parallel-size 4 \
          ...
      
    • Why it works: Distributes model weights and intermediate activations across multiple GPUs, lowering the VRAM requirement per GPU.
  5. Quantization (--quantization):

    • Diagnosis: You’re running a large model and can’t fit it even with aggressive gpu-memory-utilization.
    • Fix: Use quantization (e.g., awq, gptq). This significantly reduces model weight size.
      python -m vllm.entrypoints.api_server \
          --model ... \
          --quantization awq \
          --quantize-awq-wbits 4 \
          --quantize-awq-groupsize 128 \
          ...
      
    • Why it works: Reduces the precision of model weights (e.g., from 16-bit floats to 4-bit integers), drastically cutting VRAM usage for the model weights themselves.

The "Bind Failed" Conundrum

You try to start the server, and it immediately fails with OSError: [Errno 98] Address already in use.

Diagnosis:

Another process is already listening on the port you specified (default is 8000).

Fix:

  1. Find and kill the rogue process:
    sudo lsof -i :8000
    sudo kill <PID_of_rogue_process>
    
  2. Choose a different port:
    python -m vllm.entrypoints.api_server \
        --model ... \
        --port 8001 \
        ...
    

Why it works: Ensures that only your vLLM instance is bound to the network port, allowing incoming requests to reach it.

The "Model Not Found" Mirage

You point vLLM at a model name, and it throws an error like OSError: [Errno 2] No such file or directory: '/root/.cache/huggingface/hub/...'.

Diagnosis:

Hugging Face transformers library can’t find the model locally, and it’s failing to download it.

Common Causes & Fixes:

  1. No Internet Access / Firewall:
    • Fix: Ensure the server has outbound internet access to Hugging Face’s model hub. Check firewalls and proxy settings.
  2. Incorrect Model Name:
    • Fix: Double-check the model identifier on Hugging Face Hub (e.g., facebook/opt-6.7b vs. facebook/OPT-6.7B). Case sensitivity matters.
  3. Hugging Face Token Issues:
    • Fix: If the model is private or requires authentication, ensure your Hugging Face token is set correctly as an environment variable (HF_TOKEN) or in ~/.huggingface/token.
      export HF_TOKEN="your_hf_token_here"
      python -m vllm.entrypoints.api_server --model ...
      
  4. Cache Directory Permissions:
    • Fix: Ensure the user running vLLM has write permissions to the Hugging Face cache directory (default: ~/.cache/huggingface/hub).

Why it works: Guarantees that vLLM can locate and load the model weights, either from a local cache or by downloading them from the remote repository.

The "Worker Died" Mystery

Sometimes, the API server starts, but the underlying engine workers crash silently or with cryptic messages.

Diagnosis:

Check the vLLM server logs closely. Look for exceptions related to CUDA initialization, model loading, or specific layers. Often, this points to a mismatch between your CUDA toolkit version and the PyTorch/vLLM build.

Fix:

  1. Ensure CUDA Toolkit Compatibility: vLLM is built against specific CUDA versions. Check the vLLM documentation for the recommended CUDA toolkit version for your installation.
    • Example: If vLLM requires CUDA 11.8, make sure nvcc --version shows 11.8. If not, install/configure the correct CUDA toolkit.
  2. PyTorch Version: Ensure your PyTorch version is compatible with your CUDA toolkit and vLLM.
    • Example: Install PyTorch with the correct CUDA version:
      pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
      
  3. Reinstall vLLM: Sometimes a clean install helps:
    pip uninstall vllm
    pip install vllm
    

Why it works: Ensures that the deep learning framework (PyTorch) and the underlying GPU driver/runtime (CUDA) are compatible, allowing the model to be correctly compiled and run on the GPU.

Once these are sorted, you’ll likely run into issues with request batching strategies and latency optimization for specific model architectures.

Want structured learning?

Take the full Vllm course →