vLLM is a surprisingly efficient inference engine, but its internal workings can feel like a black box, especially when it comes to monitoring.
Let’s say you’ve got vLLM running and you want to see what’s going on under the hood. You’ve probably seen mentions of "health checks" and "Prometheus metrics," but what do they actually do and how do you use them?
First off, vLLM exposes a /health endpoint. This isn’t just a simple "is it alive?" ping. It’s a surprisingly deep check.
curl http://localhost:8000/health
If it returns {"health": "ok"}, that’s good, but it doesn’t tell you why it’s okay. The real magic happens with Prometheus. vLLM can expose a /metrics endpoint that feeds data into Prometheus, giving you granular insight into its performance.
Here’s a common setup in your vLLM server’s launch command or configuration:
python -m vllm.entrypoints.api_server \
--model facebook/opt-125m \
--host 0.0.0.0 \
--port 8000 \
--served-model-name opt-125m \
--disable-log-requests \
--metrics-port 9090 # New: Expose metrics on port 9090
Now, if you curl http://localhost:9090/metrics, you’ll see a stream of data. It’s a lot, but let’s focus on the crucial bits.
The core of vLLM’s efficiency is its PagedAttention mechanism. This is what allows it to manage KV cache memory very effectively, avoiding fragmentation and enabling higher throughput. Prometheus metrics give you visibility into this.
The vllm_kv_cache_usage_bytes metric is your window into PagedAttention. It shows you how much memory is being used for the KV cache.
vllm_kv_cache_usage_bytes{model="opt-125m"} 12345678
This number directly reflects the memory allocated to store the key and value states for attention layers across all active requests. A steadily increasing or consistently high value here isn’t necessarily bad; it means your model is busy. But if it spikes unexpectedly or hits limits, you’re going to see performance degradation.
Then there’s vllm_num_requests_total. This is a counter that increments for every request that vLLM successfully processes.
vllm_num_requests_total{model="opt-125m",status="success"} 1500
vllm_num_requests_total{model="opt-125m",status="error"} 5
You’ll see different labels for status, like success and error. Monitoring the error count is critical for identifying issues before they become widespread.
The vllm_request_latency_seconds metric (a histogram) is your best friend for understanding user experience.
vllm_request_latency_seconds_bucket{le="0.1",model="opt-125m",status="success"} 1400
vllm_request_latency_seconds_bucket{le="0.5",model="opt-125m",status="success"} 1480
vllm_request_latency_seconds_count{model="opt-125m",status="success"} 1500
vllm_request_latency_seconds_sum{model="opt-125m",status="success"} 750.0
This tells you how long requests are taking. You can calculate percentiles (e.g., P95, P99) from the buckets to understand your worst-case performance. If your P95 latency starts creeping up, it’s a strong signal that your inference server is struggling to keep up with the load.
Now, what about the actual health checks? The /health endpoint, while simple, actually checks a few things internally. It verifies that the model has been loaded successfully and that the request queue isn’t critically backed up.
The underlying mechanism for this health check involves a quick peek at the vllm_request_queue_length metric. If this metric exceeds a certain internal threshold (which isn’t directly exposed but can be inferred), the /health endpoint might start returning an error or a degraded status.
The most counterintuitive aspect of vLLM’s performance metrics is how they relate to batching. While you might think larger batches always mean higher throughput, vLLM’s PagedAttention and continuous batching strategy mean that it can achieve high throughput even with small or varying batch sizes. The metrics vllm_num_batched_requests_total and vllm_batch_size_statistics will show you this dynamic behavior. You’ll often see the batch_size metric varying significantly, demonstrating that vLLM is intelligently forming batches on the fly rather than waiting for a fixed-size batch to fill, which is a key differentiator from older inference engines.
If you’re seeing 503 Service Unavailable errors from the /health endpoint, it’s almost always due to one of two things: either the model failed to load completely (check your logs for CUDA errors or OOMs during initialization), or the internal request queue has become so backlogged that the server can’t even process health checks efficiently. In the latter case, the vllm_request_queue_length metric would be extremely high.
The next thing you’ll likely want to dig into is how to configure request timeouts and queueing behavior more explicitly using the --max-model-len and --max-num-batched-tokens parameters.