The most surprising thing about vLLM’s disaggregated prefill and decode is that they are fundamentally different computational problems, and treating them as one leads to significant performance bottlenecks.

Let’s watch it in action. Imagine we’re serving a model, say lmsys/vicuna-7b-v1.5. We’ve loaded it with vLLM, and now requests are coming in. A request might look like this:

{
  "prompt": "The quick brown fox jumps over the lazy dog.",
  "max_tokens": 50
}

When this request arrives, vLLM doesn’t just blindly process it. It first enters the "prefill" phase. This is where the model processes the entire input prompt in parallel. Think of it like reading a whole paragraph at once. For our prompt, which is 9 tokens long, vLLM computes the hidden states for all 9 tokens simultaneously. This is highly parallelizable because each token’s processing is largely independent of the others at this stage, only depending on the previous token’s hidden state. The output of this phase is a set of KV cache entries for each token in the prompt.

Once the prefill is done, the model enters the "decode" phase. This is where the model generates new tokens, one by one. For our example, the first generated token might be "The". Now, for the next token generation, the model needs to consider the prompt and the first generated token. It doesn’t re-process the prompt; it uses the KV cache generated during prefill and only computes the hidden state for the new token. This is inherently sequential: you can’t know the 10th generated token without knowing the 9th.

Here’s how vLLM manages this internally. It uses a PagedAttention mechanism, which is key to its efficiency. Instead of allocating a fixed contiguous block of memory for the KV cache for each sequence, PagedAttention divides the KV cache into fixed-size "blocks." These blocks are then managed like pages in a virtual memory system. When a sequence grows (i.e., during the decode phase), vLLM can allocate a new block and link it to the existing ones. This avoids the fragmentation issues common in other systems where large contiguous blocks might be needed, leading to wasted memory.

The disaggregation means vLLM can optimize these two distinct phases differently. Prefill can leverage massive parallelism across sequences and tokens within a prompt. Decode, being sequential, is optimized for efficient KV cache lookups and minimal computation per step. vLLM’s scheduler is designed to interleave these operations. While one request is in its sequential decode phase, vLLM can be performing parallel prefill for several other incoming requests. This interleaving is crucial for achieving high throughput, especially with diverse request lengths and generation demands.

Consider the memory footprint. During prefill, the KV cache grows for the entire prompt. During decode, it grows one token at a time. vLLM’s PagedAttention is critical here because it allows sharing of KV cache blocks between different sequences that might be in different stages of their lifecycle (some prefilling, some decoding). This dynamic allocation and deallocation of memory blocks, managed by vLLM’s internal memory manager, ensures that memory is used efficiently, preventing the common problem of memory exhaustion that plagues less sophisticated KV cache management strategies.

The actual levers you control are primarily around model loading and sampling parameters. When you load a model with vLLM, you can specify gpu_memory_utilization which influences how much memory is reserved for the KV cache. During inference, parameters like temperature, top_p, and max_tokens dictate the generation process, indirectly affecting how long the decode phase will run for a given request. The prompt length, however, is the primary driver of the prefill phase’s computational cost and KV cache size.

What most people miss is the explicit separation of attention computation within PagedAttention. For a given token, the attention mechanism computes scores against all previous tokens. During prefill, all these "previous tokens" are part of the input prompt, and the computation is optimized for batching. During decode, the "previous tokens" include the entire prompt and all previously generated tokens. vLLM’s PagedAttention, by managing KV cache in discrete blocks, allows it to efficiently retrieve only the necessary past states for the current token being decoded, without needing to re-scan or re-compute attention over the entire history every single step. This is achieved by directly mapping block indices to KV cache entries, making the lookup extremely fast.

The next challenge you’ll likely encounter is managing the trade-off between latency and throughput when dealing with a high volume of very short prompts versus a smaller volume of very long prompts.

Want structured learning?

Take the full Vllm course →