Continuous batching allows vLLM to process requests much faster by not waiting for all requests in a batch to finish before starting new ones.
Here’s how it works, let’s imagine a few requests coming into the vLLM server:
{"prompt": "What is the capital of France?", "request_id": 1}
{"prompt": "Tell me a short story about a dragon.", "request_id": 2}
{"prompt": "Explain quantum entanglement in simple terms.", "request_id": 3}
Traditionally, a system would take these three requests, form a batch, send it to the GPU, and wait for all three to complete. If request 2 takes a long time, requests 1 and 3 are stuck waiting, even though the GPU could have started working on them.
vLLM’s continuous batching changes this. When request 1 finishes, vLLM immediately takes the next available request (say, request 4) and adds it to the GPU for the next forward pass, even if requests 2 and 3 are still running.
{"prompt": "What are the main components of a computer?", "request_id": 4}
This dynamic scheduling is the core of continuous batching. Instead of fixed batches, vLLM treats requests as a continuous stream, always trying to keep the GPU as busy as possible by filling in the gaps.
The benefit is a significant increase in GPU utilization and, consequently, throughput. You can serve many more users concurrently because the GPU isn’t idled waiting for straggling requests.
Here’s a simplified look at how vLLM manages this internally, using its PagedAttention mechanism. Imagine your GPU memory is like a set of physical pages. PagedAttention allows vLLM to store the KV cache (which stores intermediate attention computations for each token) in non-contiguous memory blocks.
When a new request comes in, vLLM allocates just enough memory pages for its current KV cache. As the request generates more tokens, more pages are allocated. Crucially, when a request finishes, its pages are freed up immediately, available for new requests. This granular memory management is what enables the dynamic addition and removal of requests from the processing pipeline.
Consider the vLLM Python API. You can control the maximum number of requests that can be processed concurrently using the max_num_batched_tokens parameter.
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf", max_num_batched_tokens=1024)
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
In this example, max_num_batched_tokens=1024 means that the total number of tokens across all currently running requests in the GPU’s KV cache will not exceed 1024. When this limit is reached, new incoming requests will be queued until space becomes available (i.e., some tokens are generated, freeing up KV cache space). This parameter is your primary lever for balancing throughput and latency. A higher value generally means higher throughput but potentially higher latency for individual requests as they might wait longer to be batched.
The system uses a concept called "virtual tokens" to manage the KV cache. Each request has a virtual address space for its KV cache, which PagedAttention then maps to physical GPU memory pages. This abstraction allows requests to have their KV caches scattered across memory, maximizing utilization. When a request is processed, vLLM fetches the physical pages corresponding to its virtual addresses, performs the attention computation, and writes back to those pages.
A key detail often overlooked is how vLLM handles request prioritization and scheduling within the continuous batch. While the primary goal is to fill the GPU, the scheduler also needs to decide which request gets to run on the GPU in the next micro-batch. vLLM employs a "first-come, first-served" (FCFS) approach by default for requests that are ready to be processed. However, if a request has been waiting for a long time and the GPU has capacity, it might be prioritized to avoid starvation. This dynamic selection ensures that while throughput is maximized, fairness is also maintained.
The next step in optimizing vLLM performance after mastering continuous batching is to explore model parallelism and tensor parallelism for distributing very large models across multiple GPUs.