A vLLM deployment isn’t just about loading a model; it’s a distributed system where the inference server, the model weights, and the client requests are all independent, mobile entities.

Let’s walk through a production-ready vLLM setup, not just the docker run command.

The Core: Serving the Model

At its heart, vLLM is about efficient LLM inference. The vLLM Python library is the engine, and the openai compatible server is the common way to expose it.

Here’s a basic server startup:

python -m vllm.entrypoints.openai.api_server \
    --model lmsys/vicuna-7b-v1.5 \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --max-num-seqs 1024 \
    --gpu-memory-utilization 0.9
  • --model lmsys/vicuna-7b-v1.5: This points to the model weights. You can use Hugging Face model IDs or local paths.
  • --host 0.0.0.0: Makes the server accessible from any network interface. For production, you’ll likely want to bind to a specific IP or use a reverse proxy.
  • --port 8000: The port the API server will listen on.
  • --tensor-parallel-size 2: Crucial for larger models. This splits the model weights across multiple GPUs, enabling inference on models that wouldn’t fit on a single GPU. The optimal number depends on your GPU count and VRAM.
  • --max-num-seqs 1024: The maximum number of sequences (i.e., concurrent requests) vLLM will handle. This is a key parameter for throughput.
  • --gpu-memory-utilization 0.9: Sets the fraction of GPU memory that vLLM can use. High utilization is good for performance but leaves less room for other processes or unexpected memory spikes.

Productionizing the Server

Running that command directly in a terminal isn’t production. You need process management, networking, and more.

1. Process Management (Systemd)

Let’s use systemd to manage the vLLM server. Create a service file at /etc/systemd/system/vllm-server.service:

[Unit]
Description=vLLM OpenAI API Server
After=network.target

[Service]
User=your_user
Group=your_group
WorkingDirectory=/path/to/your/vllm/project
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
    --model lmsys/vicuna-7b-v1.5 \
    --host 127.0.0.1 \
    --port 8000 \
    --tensor-parallel-size 2 \
    --max-num-seqs 1024 \
    --gpu-memory-utilization 0.9 \
    --log-level info
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
  • User and Group: Run the process as a non-root user.
  • WorkingDirectory: Where your virtual environment or project files reside.
  • ExecStart: The command itself. Note --host 127.0.0.1 for internal access, assuming a reverse proxy will handle external traffic.
  • --log-level info: Essential for debugging. vLLM logs are written to journald here.
  • Restart=on-failure: Ensures the service restarts if it crashes.

Enable and start it:

sudo systemctl daemon-reload
sudo systemctl enable vllm-server.service
sudo systemctl start vllm-server.service
sudo systemctl status vllm-server.service
journalctl -u vllm-server.service -f

2. Reverse Proxy (Nginx)

To expose the service securely and manage traffic, use a reverse proxy like Nginx.

# /etc/nginx/sites-available/vllm
server {
    listen 80;
    server_name your_domain.com;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support for streaming responses
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Enable the site and test Nginx config:

sudo ln -s /etc/nginx/sites-available/vllm /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl restart nginx

For HTTPS, you’ll want to set up SSL certificates (e.g., with Certbot).

3. GPU Configuration and Drivers

Ensure your NVIDIA drivers are installed and up-to-date. nvidia-smi is your best friend. vLLM relies heavily on CUDA.

nvidia-smi

Check the driver version and CUDA compatibility. If you’re using Docker, ensure your Docker installation is configured to use the NVIDIA Container Toolkit.

Security Considerations

  • Authentication/Authorization: The OpenAI API server itself has no built-in authentication. You must implement this at the reverse proxy layer (e.g., Nginx with auth_request to an auth service) or an API gateway. Never expose an unauthenticated LLM endpoint directly.
  • Rate Limiting: Implement rate limiting in Nginx or your API gateway to prevent abuse and denial-of-service attacks.
  • Network Segmentation: Run the vLLM server in a private network if possible, only exposing it via the reverse proxy.
  • Input Sanitization: While vLLM is not a web application in the traditional sense, be mindful of prompt injection vulnerabilities if your application is building prompts based on user input. The model itself can be exploited.

Monitoring and Observability

  • System Metrics: Monitor CPU, memory, disk I/O, and network traffic on the server. htop, vmstat, iostat, netstat are basic tools.
  • GPU Metrics: This is critical. nvidia-smi provides real-time GPU utilization, memory usage, and temperature. For historical data, consider nvtop or Prometheus exporters like dcgm-exporter.
    • Key metrics to watch:
      • GPU Utilization: Should be high during inference.
      • Memory Used: Should stay below your --gpu-memory-utilization * 0.95 to avoid OOM errors.
      • PCIe Throughput: Indicates how fast data is moving between CPU and GPU, important for tensor parallelism.
  • vLLM Logs: As configured with systemd, logs go to journalctl. Parse these for errors, warnings, and performance insights.
  • Application Metrics:
    • Request Latency: How long does it take from request to response?
    • Throughput: Requests per second (RPS).
    • Error Rate: HTTP 5xx errors from the vLLM server.
    • Token Generation Speed: Tokens per second.
    • Queue Depth: Number of requests waiting.

You can expose Prometheus metrics from vLLM itself (if supported in newer versions, or via custom middleware) or scrape Nginx logs.

A common pattern is to use Prometheus for metrics collection and Grafana for visualization. You’d typically set up a node_exporter for system metrics and dcgm-exporter for GPU metrics.

Advanced Configurations

  • Quantization: For larger models, consider quantization (e.g., AWQ, GPTQ) to reduce VRAM usage and potentially increase speed, often with a small accuracy trade-off. You’d load a quantized model like --model TheBloke/Llama-2-7B-Chat-AWQ.
  • KV Cache Optimization: max-num-seqs directly impacts KV cache size. If you see OutOfMemoryError: KV cache is full, you might need to increase this, but it consumes more VRAM. Alternatively, reduce --gpu-memory-utilization slightly or use a smaller model.
  • Model Parallelism (tensor-parallel-size): For models larger than a single GPU, this is essential. For very large models, you might even need pipeline parallelism, which vLLM also supports but is more complex to configure.
  • Continuous Batching: vLLM’s core feature. It dynamically batches incoming requests to maximize GPU utilization. The --max-num-seqs parameter influences how many sequences can be active in the batch.

The Next Step

After you’ve got your vLLM server running, secured, and monitored, the next logical step is to integrate it into a larger application architecture, perhaps involving a vector database for RAG or a complex orchestration layer for agentic behavior.

Want structured learning?

Take the full Vllm course →