vLLM can actually run faster than Hugging Face Transformers on a single GPU, even for tasks you might think would be CPU-bound.
Let’s get vLLM installed and running on your GPU. We’ll start with the essentials and then touch on some common pitfalls.
First, ensure you have a compatible NVIDIA GPU with sufficient VRAM for the models you intend to run. For many popular LLMs like Llama 2 7B, 13B, or Mistral 7B, 16GB of VRAM is a good starting point. You’ll also need CUDA and cuDNN installed. The easiest way to manage this is often through an NVIDIA container toolkit or a pre-built Docker image.
The core installation is simple. Open your terminal and run:
pip install vllm
This installs the Python package. If you’re using PyTorch, vLLM will automatically use your installed PyTorch version. If you need a specific PyTorch version, it’s best to install that first:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install vllm
Replace cu118 with your CUDA version (e.g., cu117, cu121).
Now, let’s run a basic inference. We’ll use the vllm.LLM class.
from vllm import LLM, SamplingParams
# Model to use. Mistral-7B is a good starting point.
model = "mistralai/Mistral-7B-Instruct-v0.1"
# Sampling parameters for controlling generation.
# We'll use greedy decoding here for simplicity.
sampling_params = SamplingParams(temperature=0.0, top_p=1.0)
# Initialize the LLM. This will download the model if not cached.
# Setting tensor_parallel_size=1 means using a single GPU.
llm = LLM(model=model, tensor_parallel_size=1)
# Prompts to generate text from.
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
# Generate text.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
When LLM(model=model, tensor_parallel_size=1) is called, vLLM orchestrates several key operations. It first determines the model architecture and loads the weights. Crucially, it then allocates KV cache memory. This KV cache is a contiguous block of GPU memory that stores the key and value states of previous tokens, allowing vLLM to avoid recomputing them during generation. This is a primary reason for its speed. If you encounter an out-of-memory error here, it means your GPU doesn’t have enough VRAM for the model and its associated KV cache. The tensor_parallel_size parameter controls how many GPUs are used for model parallelism; for a single GPU, 1 is correct.
If you’re building a service, you’ll likely want to use the OpenAI-compatible server.
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B-Instruct-v0.1 \
--port 8000 \
--tensor-parallel-size 1
This command starts a web server that exposes an API endpoint very similar to OpenAI’s. You can then interact with it using standard HTTP requests or OpenAI client libraries. The --port specifies the network port, and --tensor-parallel-size again indicates single-GPU usage.
One common issue is insufficient VRAM, especially with larger models. If LLM(model=model) fails with a CUDA out-of-memory error, you have a few options. The most direct is to use a smaller model. If you must use a larger model, you can try reducing max_model_len during LLM initialization, though this limits the maximum sequence length the model can handle. For example, llm = LLM(model=model, max_model_len=2048). This reduces the KV cache footprint.
Another potential bottleneck is the prompt processing phase. vLLM uses PagedAttention, a novel attention algorithm that manages the KV cache in a paged manner, similar to how operating systems manage memory. This significantly reduces memory fragmentation and waste compared to traditional attention mechanisms. When you send a long prompt, vLLM processes it token by token, storing the intermediate KV states. If this phase is slow, it’s often due to the sheer number of tokens. For very long contexts, consider techniques like summarization or prompt engineering to reduce input length.
If you see AttributeError: module 'torch' has no attribute 'cuda', it means your PyTorch installation doesn’t have CUDA support enabled or it’s not finding your CUDA installation. Ensure your torch installation matches your CUDA version. Reinstalling PyTorch with the correct index-url as shown earlier is the usual fix.
A more subtle performance issue can arise from the dtype of the model weights. vLLM defaults to float16 for efficiency. If you encounter numerical instability or unexpected output, you might try bfloat16 if your GPU supports it (e.g., Ampere architecture and newer), or even float32 (though this significantly increases VRAM usage). You can specify this during LLM initialization: llm = LLM(model=model, dtype="bfloat16").
Finally, if your model is hosted on Hugging Face Hub and you get OSError: [Errno 2] No such file or directory, it usually means the model weights are not found locally and vLLM couldn’t download them. Ensure the model identifier is correct and that you have network connectivity. You can also pre-download models using Hugging Face’s transformers library or by cloning the repository.
The next hurdle you’ll likely face is optimizing inference throughput for multiple concurrent requests, which involves understanding vLLM’s request batching and queueing mechanisms.