vLLM is doing something pretty wild with your LLM requests, and it’s not just a simple queue. It’s actively kicking lower-priority requests out of the way to make room for higher-priority ones.
Here’s a simplified look at how it works under the hood with a couple of concurrent requests, one high-priority and one low-priority.
from vllm import LLM, SamplingParams
# Initialize vLLM with a model
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
# Define sampling parameters, including priority
sampling_params_high = SamplingParams(temperature=0.8, top_p=0.95, priority=10) # Higher number means higher priority
sampling_params_low = SamplingParams(temperature=0.8, top_p=0.95, priority=1)
# Prompts
prompts = [
"What is the capital of France?",
"Tell me a very long and elaborate story about a dragon who loves to knit.",
]
# Generate requests with different priorities
requests = [
(prompts[0], sampling_params_high),
(prompts[1], sampling_params_low),
]
# Submit requests and observe scheduling
outputs = []
for prompt, params in requests:
print(f"Submitting prompt: '{prompt}' with priority {params.priority}")
# In a real scenario, these would be submitted concurrently or in rapid succession.
# For demonstration, we'll simulate by generating one after the other,
# but vLLM's scheduler would be active if multiple were pending.
outputs.append(llm.generate(prompt, sampling_params=params))
# In a real-time system, the high-priority request would likely preempt
# processing of the low-priority one if resources were contended.
# The output below would reflect the completion order, which might
# differ from submission order if preemption occurred.
for output in outputs:
print(output)
In this example, if the LLM is processing the long dragon story (low priority) and a request for the capital of France (high priority) comes in, vLLM’s scheduler might pause the dragon story, finish the capital of France request, and then resume the dragon story. This is controlled by the priority parameter in SamplingParams. A higher numerical value signifies a higher priority.
The core problem vLLM’s priority scheduling solves is resource contention in a high-throughput LLM serving environment. When you have multiple requests hitting your LLM server simultaneously, especially with models that are computationally intensive, you can’t guarantee that every request will get processed instantly. If you have a mix of interactive, time-sensitive requests (like a chatbot responding to a user) and background, batch-like tasks (like generating a report), you don’t want the long-running batch job to block the interactive one.
Internally, vLLM uses a sophisticated scheduler that manages the Continuous Batching of requests. Instead of waiting for a whole batch of requests to finish before starting the next, it continuously adds new requests and removes completed ones, optimizing GPU utilization. The priority system is layered on top of this. When the scheduler needs to decide which sequence to continue processing next, it looks at the priorities. If a higher-priority request arrives and there are insufficient resources (like GPU memory or compute time) to start it immediately, it can preempt (pause) a lower-priority sequence that is currently being processed.
The priority parameter in SamplingParams is your primary lever. It’s an integer, and higher values mean higher priority. You can set this per request. The scheduler will attempt to allocate KV cache memory and processing time based on these priorities. When a new request with a higher priority arrives, and the system is at capacity, vLLM will look for the lowest priority sequence that is currently not generating output tokens (i.e., it’s waiting for its turn to run on the GPU) and evict its KV cache from memory. This frees up space for the higher-priority request. If all waiting sequences have higher or equal priority, the new request will be queued.
The most surprising thing most people don’t realize is that preemption isn’t just about stopping a job; it involves intelligently saving and restoring the state of the interrupted sequence. vLLM’s scheduler carefully manages the KV cache. When a sequence is preempted, its KV cache is not discarded but rather moved out of active GPU memory (e.g., to CPU RAM or even disk if necessary, though typically RAM is used for performance). When the higher-priority sequence finishes or is also preempted, vLLM can then resume the lower-priority sequence by reloading its KV cache, allowing it to continue processing from where it left off with minimal overhead. This state management is critical for making preemption efficient and not just a disruptive force.
As you delve deeper into optimizing vLLM performance, understanding how to tune these priorities and monitor their impact on latency and throughput for different request types will be your next key challenge.