Chunked Prefill is a technique in vLLM that breaks down the initial prompt processing (prefill) into smaller chunks. This allows vLLM to interleave prompt processing with new token generation, significantly boosting throughput and reducing Time To First Token (TTFT) for long prompts.
Let’s see this in action. Imagine we have a long prompt, say 10,000 tokens, and we’re using a model like Llama 3 70B. Without Chunked Prefill, vLLM would first process all 10,000 tokens before generating a single output token. This can take a considerable amount of time and memory, especially if you have multiple such requests.
Here’s how it works: vLLM divides the prompt into smaller, manageable chunks. For instance, if the prompt is 10,000 tokens and we set chunk_size to 1024, the prompt will be processed in roughly 10 chunks. As vLLM processes each chunk, it can start generating output tokens. This interleaving is key. Instead of a long, sequential "wait for prompt to finish, then generate," it becomes a more parallel "process a bit of prompt, generate a bit of output, process more prompt, generate more output."
The primary levers you have to tune Chunked Prefill are chunk_size and max_num_seqs.
chunk_size: This parameter dictates the maximum number of tokens that will be processed in a single prefill chunk. A smaller chunk_size means more frequent interleaving of prompt processing and generation. This is generally good for reducing TTFT, as output generation can start sooner. However, very small chunk_size values can introduce overhead from more frequent kernel launches and context switching, potentially hurting overall throughput. A good starting point is often a multiple of your model’s typical block size or a power of 2, like 1024 or 2048.
max_num_seqs: This parameter limits the maximum number of sequences (requests) that can be processed concurrently. When Chunked Prefill is enabled, vLLM can handle more sequences because the memory pressure from processing a full long prompt is distributed over time. Increasing max_num_seqs allows you to serve more users or requests simultaneously, directly impacting throughput. You’ll need to balance this with your available GPU memory.
Consider a scenario where you have many concurrent requests, each with a moderately long prompt (e.g., 2000 tokens). Without Chunked Prefill, your GPU might be saturated by the prompt processing phase of just a few requests. With Chunked Prefill and an appropriately tuned chunk_size (say, 512) and a higher max_num_seqs (e.g., 1024), vLLM can start generating output for these requests much earlier. This means the GPU is not idle waiting for prompts and can switch between processing prompt chunks and generating output tokens for multiple requests. The memory usage per request also becomes more bursty rather than a sustained high peak, allowing more requests to coexist.
The magic of Chunked Prefill is that it leverages the fact that attention computations are quadratic in sequence length, but only for the prompt sequence. Once generation begins, the sequence length grows linearly. By breaking the prompt into chunks, vLLM can perform attention on smaller segments, reducing the peak computational load and memory footprint during the initial prompt processing phase. It then cleverly stitches these processed chunks together conceptually, allowing generation to proceed as if the entire prompt was processed at once, but with the benefits of early output. This is akin to a producer-consumer model where prompt processing is the producer, and token generation is the consumer, with vLLM orchestrating the flow efficiently.
When you enable chunked_prefill_token_size (which is the chunk_size parameter in the vLLM API) in your vLLM engine configuration, vLLM automatically switches to this mode. For example, when launching the OpenAI-compatible server:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70b-chat-hf \
--chunked-prefill-token-size 1024 \
--max-num-seqs 2048 \
--port 8000
Here, we’ve set chunk_size to 1024 and max_num_seqs to 2048. This configuration tells vLLM to break down prompts into chunks of at most 1024 tokens and to allow up to 2048 sequences to be processed concurrently. The effect is that even with very long prompts, you’ll see output tokens appearing much faster, and you’ll be able to handle a significantly higher number of simultaneous requests before hitting GPU memory limits or throughput bottlenecks.
A subtle but critical aspect often overlooked is how the chunk_size interacts with the model’s internal KV cache management. While a smaller chunk_size reduces the initial peak memory for prompt processing, the total KV cache occupied by a sequence still grows with its total length (prompt + generated tokens). However, by enabling chunked prefill, vLLM can more effectively pack sequences into the KV cache by interleaving their prompt processing and generation phases. This means a sequence that might have been rejected due to KV cache limitations without chunked prefill can now be accepted, effectively increasing the number of sequences that can be served.
The next frontier is understanding how dynamic batching interacts with Chunked Prefill for even finer-grained optimization.