vLLM can process requests much faster than traditional methods because it runs inference in batches, which is essentially a way of grouping multiple requests together to be processed by the model simultaneously.

Let’s see this in action. Imagine we have a vLLM server up and running, and we want to send it a few prompts to get some completions.

# Assuming your vLLM server is running on localhost:8000
curl http://localhost:8000/generate -X POST -d '{
    "prompt": "The capital of France is",
    "n": 1,
    "use_beam_search": false,
    "temperature": 0.7,
    "max_tokens": 10
}' -H 'Content-Type: application/json'

curl http://localhost:8000/generate -X POST -d '{
    "prompt": "The largest planet in our solar system is",
    "n": 1,
    "use_beam_search": false,
    "temperature": 0.7,
    "max_tokens": 10
}' -H 'Content-Type: application/json'

curl http://localhost:8000/generate -X POST -d '{
    "prompt": "To be or not to be, that is",
    "n": 1,
    "use_beam_search": false,
    "temperature": 0.7,
    "max_tokens": 10
}' -H 'Content-Type: application/json'

When these requests arrive at the vLLM server, they don’t immediately get processed one by one. Instead, vLLM waits for a short period (a configurable "batch delay") or until a certain number of requests have accumulated. Once the batch is ready, it groups these prompts together. The LLM then processes this entire batch in a single forward pass. This is a massive win because the expensive parts of the LLM computation (like matrix multiplications) are amortized across all requests in the batch, dramatically reducing the per-request overhead.

The core problem vLLM’s batching solves is the inefficiency of processing single, independent requests for large language models. Traditional inference engines often process requests sequentially. For a single request, the model performs a forward pass. If another request arrives, it waits for the first one to finish, then performs another forward pass. This is incredibly wasteful, especially on powerful hardware like GPUs, because the computational units are idle for a significant portion of the time between requests. Large language models are also very memory-intensive. Keeping the model weights in GPU memory is crucial for speed, but managing this memory efficiently for a dynamic stream of requests is complex.

vLLM tackles this by introducing two key innovations: PagedAttention and continuous batching. PagedAttention is a memory management system inspired by virtual memory and paging in operating systems. Instead of allocating a contiguous block of memory for each sequence’s KV cache (which stores intermediate attention computations), PagedAttention divides the KV cache into fixed-size blocks. These blocks can be non-contiguously stored in GPU memory and shared. This allows for much more efficient memory utilization, reducing fragmentation and enabling larger batch sizes. Continuous batching builds on this by allowing new requests to be added to a running batch and old requests to be removed as they complete, without waiting for the entire batch to finish. This keeps the GPU busy processing requests as much as possible.

The exact levers you control are primarily related to how vLLM manages these batches and memory. The max_num_batched_tokens parameter in the vLLM engine configuration is crucial. This setting dictates the maximum number of tokens across all sequences in a single batch that the GPU can hold. If a new batch would exceed this limit, vLLM will wait for the current batch to complete or for space to become available. Another important setting is max_num_seqs, which limits the total number of sequences that can be processed concurrently. Tuning these parameters, alongside the gpu_memory_utilization which sets the fraction of GPU memory to use for PagedAttention, directly impacts throughput and latency.

When vLLM is processing requests, it doesn’t just wait for a fixed number of requests to arrive before creating a batch. Instead, it uses a dynamic approach where it can add incoming requests to a currently processing batch if there is enough KV cache memory available, even if that batch isn’t "full" in terms of the number of sequences. This means that a request might join a batch that has already started processing its current step, significantly reducing the latency for that request compared to waiting for a completely new batch to form. This ability to interleave requests into ongoing batches is a core part of why vLLM achieves such high throughput.

The next hurdle you’ll likely encounter is managing the tail latency of requests within a dynamically batched system, especially when dealing with highly variable request lengths and generation requirements.

Want structured learning?

Take the full Vllm course →