The most surprising thing about vLLM’s OpenAI-compatible API is that it’s not just compatible, it often outperforms OpenAI’s own models in terms of raw throughput for the same model weights.

Let’s see it in action. Imagine you have vllm installed and a model like meta-llama/Llama-2-7b-chat-hf downloaded. You can launch a server with a single command:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000 \
    --host 0.0.0.0

Now, from another terminal, you can interact with it using curl as if it were OpenAI:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "max_tokens": 50
    }'

The response will look like this:

{
    "id": "chatcmpl-xxxxxxxxxxxxxxxxxxxxxxxxx",
    "object": "chat.completion",
    "created": 1677652288,
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": " The capital of France is Paris."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 22,
        "completion_tokens": 7,
        "total_tokens": 29
    }
}

This compatibility means you can often drop-in vLLM for OpenAI requests with minimal code changes, especially if you’re using libraries like LangChain or LlamaIndex.

The core problem vLLM solves is the inefficient memory management and request scheduling in traditional LLM inference. Standard implementations often load an entire model into GPU memory, and then process requests one by one or in small batches. This leads to underutilization of GPU VRAM, especially when dealing with variable sequence lengths.

vLLM’s breakthrough is PagedAttention. Think of it like virtual memory management in operating systems. Instead of allocating contiguous blocks of memory for attention key-value (KV) caches, PagedAttention breaks these caches into fixed-size "pages" or "blocks." These pages can be scattered across GPU memory. When a request needs more KV cache space, vLLM can allocate new pages dynamically and link them together. This allows multiple requests to share the same GPU memory much more efficiently, reducing fragmentation and enabling higher batch sizes.

When you launch the server, vLLM pre-allocates a large chunk of GPU memory for its KV cache. It then uses a sophisticated scheduler that groups incoming requests into "blocks" of work. As requests generate tokens, their KV cache entries are managed using the PagedAttention mechanism. This means that even if requests have wildly different sequence lengths, vLLM can pack them together much more tightly than traditional methods. The OpenAI-compatible API is just a thin layer on top of this optimized inference engine, translating OpenAI’s JSON payloads into internal vLLM requests and then translating vLLM’s responses back into the OpenAI format.

The key levers you control are primarily through the command-line arguments when launching the server. The --model argument is obvious, but --tensor-parallel-size is critical for distributing larger models across multiple GPUs. --max-model-len sets the maximum sequence length the model can handle, which directly impacts KV cache size. --gpu-memory-utilization is a crucial knob; setting it too low means you’re not using your GPU efficiently, too high and you risk out-of-memory errors. A common starting point for a 7B model on a 24GB GPU might be --gpu-memory-utilization 0.9.

The attention mechanism itself, while abstracted away by PagedAttention, is what makes LLMs so computationally intensive. Each token needs to attend to all previous tokens to compute its representation. The KV cache stores the intermediate computations (keys and values) for each token in the sequence, so the model doesn’t have to recompute them. PagedAttention’s magic is in how it manages this cache space without wasting it on padding or fragmentation, allowing vLLM to serve more concurrent users or handle longer contexts with the same hardware.

When you’re running a model with vLLM and notice that its max_tokens parameter in the API request seems to be ignored, it’s usually because the model has a hardcoded maximum context window that vLLM respects. For example, Llama 2 models typically have a 4096 token context window. If you request max_tokens: 5000, the generation will stop once the model hits its internal limit, not necessarily after 5000 tokens. You’ll see finish_reason: "length" in the response, indicating it hit the model’s maximum length.

Want structured learning?

Take the full Vllm course →