vLLM’s P99 latency dashboards are often misunderstood because they don’t just measure request processing time; they also include the time spent waiting for a KV cache slot.
Let’s see vLLM in action. Imagine we have a simple FastAPI app serving a vLLM model:
from fastapi import FastAPI
from vllm import LLM, SamplingParams
import time
app = FastAPI()
# Load the model. This can be slow!
llm = LLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)
@app.post("/generate")
async def generate(prompt: str):
start_time = time.time()
# The actual inference call
outputs = llm.generate(prompt, sampling_params)
end_time = time.time()
return {"text": outputs[0].outputs[0].text, "latency_ms": (end_time - start_time) * 1000}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
When you send requests to this endpoint, vLLM’s internal metrics capture a lot more than just the GPU computation. The "P99 Latency" you see in your monitoring dashboard is the 99th percentile of the total time from when vLLM receives a request to when it returns a response. This includes several stages:
- Request Queuing: The request arrives at vLLM and is placed in an internal queue.
- KV Cache Allocation: vLLM needs to allocate space in its Key-Value (KV) cache for the prompt and generated tokens. This is a critical bottleneck, especially with long contexts or many concurrent requests. If no slots are available, the request waits here.
- Prompt Processing: The initial prompt is processed by the model to generate the first set of output tokens. This involves significant GPU computation.
- Token Generation (Sampling): Subsequent tokens are generated one by one, with each generation step being relatively fast but dependent on the previous one.
- Response Formatting: The generated text is formatted and sent back to the client.
The KV cache is a shared resource that stores intermediate attention states for each active sequence. When a new request comes in, vLLM tries to find contiguous blocks of KV cache memory. If the cache is fragmented or full, requests can get stuck waiting for memory to be freed up by other, longer-running requests. This waiting time, often referred to as "cache wait time," is a major contributor to high P99 latencies, even when the GPU itself is not saturated.
To effectively monitor and optimize vLLM performance, you need to look beyond just GPU utilization. Tools like Prometheus with vllm_request_processing_time and vllm_engine_running_time metrics are essential. You should also instrument your application to track request ingress and egress times separately to isolate vLLM’s internal processing.
The most surprising thing about vLLM’s performance is how much latency can be attributed to KV cache contention rather than raw GPU compute power. A common mistake is to assume that if GPU utilization is low, latency must be low. However, if your KV cache is highly fragmented or requests are waiting for large blocks of memory, the GPU might be idle for significant periods, waiting for the cache manager to free up resources. This is particularly true for models with large KV caches (e.g., Llama 2 70B) or when serving many concurrent requests with varying sequence lengths. You might have a powerful GPU, but if the cache allocation strategy leads to frequent blocking, your P99 latency will suffer.
To get a more granular view, you can enable detailed logging in vLLM or use its internal profiling tools. Look for metrics related to "queueing time" or "waiting for KV cache."
Understanding the interplay between KV cache management and GPU execution is key to diagnosing and mitigating high P99 latencies in vLLM deployments.
The next challenge you’ll encounter is optimizing the KV cache itself, potentially through techniques like quantization or dynamic batching strategies.