The most surprising thing about vLLM load balancing is that it’s not just about spreading requests evenly; it’s about predicting which replica can finish a request the fastest, given its current state.

Let’s see this in action. Imagine we have two identical replicas of a large language model, say, llama-2-7b, running on two separate GPUs. We’re using vLLM’s OpenAI-compatible server.

# Replica 1
python -m vllm.entrypoints.openai.api_server \
    --model lmsys/llama-2-7b-chat-hf \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --served-model-name llama-2-7b-replica-1

# Replica 2
python -m vllm.entrypoints.openai.api_server \
    --model lmsys/llama-2-7b-chat-hf \
    --host 0.0.0.0 \
    --port 8001 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --served-model-name llama-2-7b-replica-2

Now, we need a load balancer to sit in front of these. vLLM doesn’t ship with a dedicated load balancer process, but its core logic can be integrated or used by external load balancers. For demonstration, let’s simulate requests going to these replicas. A common setup would be an Nginx or HAProxy instance configured to forward requests to http://<replica1_ip>:8000 and http://<replica2_ip>:8001.

When a request comes in, say, "What is the capital of France?", the load balancer needs to decide which replica gets it. If it were a simple round-robin, it would just alternate. But vLLM’s intelligent routing looks deeper. It considers factors like:

  • Current GPU utilization: How much VRAM is being used by ongoing requests?
  • Number of active requests: How many sequences are currently being processed by each replica?
  • Prompt length and expected output length: Longer prompts and expected outputs require more KV cache and computation.
  • Batching efficiency: Can the new request be batched with existing ones on a replica to maximize GPU throughput?

vLLM’s ContinuousBatching engine is the heart of this. It doesn’t wait for a whole batch to be ready. Instead, it continuously samples from the waiting requests and adds them to the ongoing execution if they can be processed efficiently. This means the "state" of a replica is dynamic: it’s not just about how many requests are there, but how far along they are and what their resource demands are.

The load balancer, if it’s vLLM-aware (e.g., a custom controller or a smart proxy), would query each replica’s API endpoint (vLLM servers expose /health and /info endpoints which can be used for this) to get this state. It then uses a scoring mechanism to pick the replica that is predicted to have the lowest completion time for the new request. This might mean sending a new request to a replica that already has more requests, if those requests are short or nearly finished, allowing the new request to be processed quickly without significant contention.

Let’s say Replica 1 has two short requests finishing soon, and Replica 2 has one long request. Even though Replica 2 has fewer requests, the long one is consuming significant KV cache. A vLLM-aware load balancer might choose Replica 1 because the new, short request can be appended to the existing batch and finish very quickly.

The core idea is to minimize the tail latency – the latency experienced by the slowest requests. By intelligently routing, vLLM aims to keep all requests moving as fast as possible, rather than just ensuring average throughput.

The specific levers you control are primarily through the configuration of the individual vLLM servers. The gpu-memory-utilization parameter is crucial; setting it too high can lead to out-of-memory errors, while setting it too low underutilizes your hardware. The tensor-parallel-size dictates how many GPUs a single model replica spans. For load balancing across replicas, the key is ensuring each replica is configured with sufficient resources and then using an external mechanism (or a vLLM controller) to dynamically assess replica states and route traffic.

The critical component that enables this intelligent routing is vLLM’s RequestScheduler. It manages the incoming requests, their states (e.g., prompt processing, generation), and how they are batched together for efficient GPU execution. The scheduler’s internal logic, which prioritizes requests based on their potential to be batched and their estimated completion time, is what the load balancer ideally leverages.

What most people don’t realize is that the "best" replica isn’t always the one with the fewest requests. A replica with more requests might be better if those requests are very close to completion, allowing a new request to be added to the current, highly optimized batch and finish faster than if it were sent to a less utilized but less "warm" replica.

The next step in optimizing this setup is to explore custom load balancing strategies that can directly query vLLM’s internal state for more granular decision-making, or to investigate vLLM’s experimental controller features.

Want structured learning?

Take the full Vllm course →