The KV cache, not GPU memory itself, is the primary bottleneck for vLLM’s throughput.
Let’s see vLLM in action. Imagine we’re serving a small language model, gpt2, and we want to handle a good number of concurrent requests.
from vllm import LLM, SamplingParams
# Sample prompts on which to generate.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Set sampling parameters for generation.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50)
# Create an LLM instance.
# By default, vLLM will try to use as much GPU memory as possible for the KV cache.
llm = LLM(model="gpt2")
# Generate text.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
When you run this, vLLM automatically allocates a significant chunk of your GPU memory to store the Key-Value (KV) cache. This cache is crucial because it holds the intermediate attention layer computations for each token generated so far in a sequence. Instead of recomputing these values for every new token, vLLM can simply look them up, dramatically speeding up inference.
The problem is that this KV cache grows linearly with the sequence length and the number of concurrent requests. For a model with n_layers layers and n_heads attention heads, each with a hidden dimension d_k, the KV cache size per token is roughly 2 * n_layers * n_heads * d_k * sizeof(float). If your sequence length is 2048 and you have 32 concurrent requests, even with a small model, this can quickly consume gigabytes of VRAM.
vLLM’s PagedAttention algorithm is designed to manage this KV cache efficiently, much like how an operating system manages virtual memory. It divides the KV cache into fixed-size blocks, allowing for flexible allocation and deallocation. However, the total amount of KV cache that can be allocated is still bounded by the available GPU memory. When you hit this limit, you can’t serve more requests, or worse, your entire process might crash with an Out-of-Memory (OOM) error.
The key to optimizing throughput isn’t just about having more GPU memory, but about configuring how much of that memory is dedicated to the KV cache. vLLM’s gpu_memory_utilization parameter controls the fraction of GPU memory that vLLM is allowed to use. By default, it’s set to 0.90 (90%), meaning vLLM will attempt to use up to 90% of your GPU’s VRAM for its operations, primarily the KV cache.
You can explicitly control this. If you find your system is OOMing, or if you want to reserve some VRAM for other processes (like data preprocessing or model saving), you can lower this value. For instance, to ensure vLLM only uses 70% of your GPU memory, you would initialize it like this:
llm = LLM(model="gpt2", gpu_memory_utilization=0.70)
This tells vLLM to be more conservative. It will allocate less memory for the KV cache, which means you might hit the max_tokens limit for individual requests sooner or be able to serve fewer concurrent requests before hitting the KV cache capacity for that specific configuration. However, it increases the likelihood that your inference server remains stable and responsive, especially under heavy load. Conversely, if you have ample VRAM and want to maximize the number of concurrent requests you can handle for longer sequences, you might increase this value slightly, but be cautious not to exceed your physical VRAM.
The trickiest part is that the optimal gpu_memory_utilization isn’t a static number; it depends heavily on the model size, the typical sequence lengths of your prompts and generated outputs, and the number of concurrent requests you aim to serve. For smaller models like GPT-2, you might be able to get away with higher utilization, while larger models like Llama 2 70B will demand a much larger KV cache, forcing you to reduce gpu_memory_utilization or increase the physical GPU memory.
There’s a delicate balance: increasing gpu_memory_utilization allows for a larger KV cache, which in turn enables serving more concurrent requests or longer sequences up to the KV cache’s capacity. However, if this capacity is exceeded, you’ll get an OOM. Reducing gpu_memory_utilization reserves memory, making OOMs less likely but potentially capping the number of concurrent requests or sequence lengths you can handle.
The next hurdle you’ll likely encounter is managing prompt processing throughput, especially with very long prompts.