Serving large language models efficiently is a surprisingly complex dance between hardware, software, and the model’s own architecture.

Let’s see vLLM in action. Imagine we have a quantized Llama 2 7B model. Here’s how we’d load and serve it:

python -m vllm.entrypoints.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --tensor-parallel-size 2 \
    --dtype bfloat16 \
    --max-num-seqs 1024 \
    --gpu-memory-utilization 0.9

This command kicks off an OpenAI-compatible API server. The --model flag points to our Llama 2 model. --tensor-parallel-size 2 tells vLLM to split the model weights across two GPUs, which is crucial for models that don’t fit on a single card. --dtype bfloat16 uses a lower-precision floating-point format, saving memory and speeding up computation with minimal accuracy loss. --max-num-seqs 1024 sets the maximum number of concurrent requests (sequences) the server can handle, and --gpu-memory-utilization 0.9 instructs vLLM to use 90% of available GPU memory, leaving a small buffer for system overhead.

Once running, you can send requests like this:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "prompt": "What is the capital of France?",
        "max_tokens": 50,
        "temperature": 0.7
    }'

The server will respond with the model’s generated text.

vLLM tackles the bottleneck of LLM serving by introducing a novel memory management system called PagedAttention. Traditional attention mechanisms require contiguous memory blocks for key-value (KV) cache, leading to significant internal fragmentation as sequences vary in length. PagedAttention, inspired by virtual memory in operating systems, divides the KV cache into fixed-size blocks. This allows for non-contiguous allocation, dramatically reducing memory waste and enabling higher throughput. It also allows for efficient sharing of KV cache blocks between different requests that share a common prefix, a technique called "out" or "prefix caching", further boosting efficiency.

The core problem vLLM solves is the dramatic underutilization of GPU resources when serving LLMs. Naive implementations often struggle with the memory overhead of the KV cache, which grows with sequence length. This memory fragmentation and the need to recompute attention for every token sequentially limit the number of concurrent requests a server can handle, leading to low GPU utilization and high latency. vLLM’s PagedAttention, along with its continuous batching mechanism, directly addresses these issues by optimizing memory allocation and request scheduling.

The --tensor-parallel-size parameter is your primary lever for distributing model weights across multiple GPUs. If your model’s weights alone exceed the VRAM of a single GPU, this is non-negotiable. For instance, a 70B parameter model in FP16 (2 bytes per parameter) requires roughly 140GB of VRAM just for weights. Splitting this across two A100 80GB GPUs (160GB total) is necessary. The --pipeline-parallel-size (not shown in the example) is another distribution strategy, splitting layers of the model across GPUs. Tensor parallelism splits individual layers, while pipeline parallelism splits the sequence of layers. vLLM primarily focuses on tensor parallelism for its efficiency gains.

The --max-num-seqs parameter directly controls the batch size. A higher value allows vLLM to pack more requests into a single forward pass, amortizing the computational cost of attention across more requests and improving throughput. However, setting this too high can lead to out-of-memory errors if the total KV cache size for all active sequences exceeds available GPU memory. The --gpu-memory-utilization parameter is your safeguard here, ensuring that vLLM doesn’t overcommit VRAM. Finding the sweet spot between --max-num-seqs and --gpu-memory-utilization is key to maximizing throughput without crashing.

When using models with specialized attention mechanisms, like grouped-query attention (GQA) or multi-query attention (MQA), vLLM’s PagedAttention is particularly effective. These architectures reduce the number of key and value heads compared to standard multi-head attention, thereby decreasing the size of the KV cache. PagedAttention’s ability to manage memory efficiently allows it to accommodate these smaller KV caches with even greater headroom, enabling even higher sequence lengths or batch sizes for these optimized models.

The ability to serve models with different quantization levels (e.g., INT8, INT4) is also a significant factor in performance. While vLLM’s core optimizations focus on memory management and batching, the underlying model’s precision impacts both speed and memory. Lower precision reduces the VRAM footprint and can accelerate computation, but it can also introduce accuracy degradation. vLLM supports various quantization formats, often through integration with libraries like AutoGPTQ or AWQ, allowing you to balance these trade-offs based on your application’s requirements.

The next step is optimizing for latency, which often involves exploring techniques like speculative decoding.

Want structured learning?

Take the full Vllm course →