A vLLM deployment isn’t just about loading a model; it’s a distributed system where the inference server, the model weights, and the client requests are all independent, mobile entities.
Let’s walk through a production-ready vLLM setup, not just the docker run command.
The Core: Serving the Model
At its heart, vLLM is about efficient LLM inference. The vLLM Python library is the engine, and the openai compatible server is the common way to expose it.
Here’s a basic server startup:
python -m vllm.entrypoints.openai.api_server \
--model lmsys/vicuna-7b-v1.5 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--max-num-seqs 1024 \
--gpu-memory-utilization 0.9
--model lmsys/vicuna-7b-v1.5: This points to the model weights. You can use Hugging Face model IDs or local paths.--host 0.0.0.0: Makes the server accessible from any network interface. For production, you’ll likely want to bind to a specific IP or use a reverse proxy.--port 8000: The port the API server will listen on.--tensor-parallel-size 2: Crucial for larger models. This splits the model weights across multiple GPUs, enabling inference on models that wouldn’t fit on a single GPU. The optimal number depends on your GPU count and VRAM.--max-num-seqs 1024: The maximum number of sequences (i.e., concurrent requests) vLLM will handle. This is a key parameter for throughput.--gpu-memory-utilization 0.9: Sets the fraction of GPU memory that vLLM can use. High utilization is good for performance but leaves less room for other processes or unexpected memory spikes.
Productionizing the Server
Running that command directly in a terminal isn’t production. You need process management, networking, and more.
1. Process Management (Systemd)
Let’s use systemd to manage the vLLM server. Create a service file at /etc/systemd/system/vllm-server.service:
[Unit]
Description=vLLM OpenAI API Server
After=network.target
[Service]
User=your_user
Group=your_group
WorkingDirectory=/path/to/your/vllm/project
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model lmsys/vicuna-7b-v1.5 \
--host 127.0.0.1 \
--port 8000 \
--tensor-parallel-size 2 \
--max-num-seqs 1024 \
--gpu-memory-utilization 0.9 \
--log-level info
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
UserandGroup: Run the process as a non-root user.WorkingDirectory: Where your virtual environment or project files reside.ExecStart: The command itself. Note--host 127.0.0.1for internal access, assuming a reverse proxy will handle external traffic.--log-level info: Essential for debugging. vLLM logs are written tojournaldhere.Restart=on-failure: Ensures the service restarts if it crashes.
Enable and start it:
sudo systemctl daemon-reload
sudo systemctl enable vllm-server.service
sudo systemctl start vllm-server.service
sudo systemctl status vllm-server.service
journalctl -u vllm-server.service -f
2. Reverse Proxy (Nginx)
To expose the service securely and manage traffic, use a reverse proxy like Nginx.
# /etc/nginx/sites-available/vllm
server {
listen 80;
server_name your_domain.com;
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# WebSocket support for streaming responses
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
Enable the site and test Nginx config:
sudo ln -s /etc/nginx/sites-available/vllm /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl restart nginx
For HTTPS, you’ll want to set up SSL certificates (e.g., with Certbot).
3. GPU Configuration and Drivers
Ensure your NVIDIA drivers are installed and up-to-date. nvidia-smi is your best friend. vLLM relies heavily on CUDA.
nvidia-smi
Check the driver version and CUDA compatibility. If you’re using Docker, ensure your Docker installation is configured to use the NVIDIA Container Toolkit.
Security Considerations
- Authentication/Authorization: The OpenAI API server itself has no built-in authentication. You must implement this at the reverse proxy layer (e.g., Nginx with
auth_requestto an auth service) or an API gateway. Never expose an unauthenticated LLM endpoint directly. - Rate Limiting: Implement rate limiting in Nginx or your API gateway to prevent abuse and denial-of-service attacks.
- Network Segmentation: Run the vLLM server in a private network if possible, only exposing it via the reverse proxy.
- Input Sanitization: While vLLM is not a web application in the traditional sense, be mindful of prompt injection vulnerabilities if your application is building prompts based on user input. The model itself can be exploited.
Monitoring and Observability
- System Metrics: Monitor CPU, memory, disk I/O, and network traffic on the server.
htop,vmstat,iostat,netstatare basic tools. - GPU Metrics: This is critical.
nvidia-smiprovides real-time GPU utilization, memory usage, and temperature. For historical data, considernvtopor Prometheus exporters likedcgm-exporter.- Key metrics to watch:
GPU Utilization: Should be high during inference.Memory Used: Should stay below your--gpu-memory-utilization* 0.95 to avoid OOM errors.PCIe Throughput: Indicates how fast data is moving between CPU and GPU, important for tensor parallelism.
- Key metrics to watch:
- vLLM Logs: As configured with
systemd, logs go tojournalctl. Parse these for errors, warnings, and performance insights. - Application Metrics:
- Request Latency: How long does it take from request to response?
- Throughput: Requests per second (RPS).
- Error Rate: HTTP 5xx errors from the vLLM server.
- Token Generation Speed: Tokens per second.
- Queue Depth: Number of requests waiting.
You can expose Prometheus metrics from vLLM itself (if supported in newer versions, or via custom middleware) or scrape Nginx logs.
A common pattern is to use Prometheus for metrics collection and Grafana for visualization. You’d typically set up a node_exporter for system metrics and dcgm-exporter for GPU metrics.
Advanced Configurations
- Quantization: For larger models, consider quantization (e.g., AWQ, GPTQ) to reduce VRAM usage and potentially increase speed, often with a small accuracy trade-off. You’d load a quantized model like
--model TheBloke/Llama-2-7B-Chat-AWQ. - KV Cache Optimization:
max-num-seqsdirectly impacts KV cache size. If you seeOutOfMemoryError: KV cache is full, you might need to increase this, but it consumes more VRAM. Alternatively, reduce--gpu-memory-utilizationslightly or use a smaller model. - Model Parallelism (
tensor-parallel-size): For models larger than a single GPU, this is essential. For very large models, you might even need pipeline parallelism, which vLLM also supports but is more complex to configure. - Continuous Batching: vLLM’s core feature. It dynamically batches incoming requests to maximize GPU utilization. The
--max-num-seqsparameter influences how many sequences can be active in the batch.
The Next Step
After you’ve got your vLLM server running, secured, and monitored, the next logical step is to integrate it into a larger application architecture, perhaps involving a vector database for RAG or a complex orchestration layer for agentic behavior.