vLLM’s multi-node serving is how you stop thinking about fitting your massive LLM into a single GPU’s memory and start thinking about fitting it across an entire cluster of machines.

Let’s see it in action. Imagine we have two nodes, node1 and node2, each with two A100 GPUs. We want to serve a 70B parameter model.

On node1, we’d start the controller:

python -m vllm.entrypoints.api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --worker-use-ray \
    --ray-address "auto"

This controller will find other vLLM workers. The --worker-use-ray flag tells it to leverage Ray for distributed management, and --ray-address "auto" means it will automatically discover the Ray cluster.

Then, on node1 (or any other node in the cluster), we’d start the workers. For instance, on node1 itself, to use its GPUs:

python -m vllm.entrypoints.api_server \
    --host 0.0.0.0 \
    --port 8001 \
    --worker-use-ray \
    --ray-address "auto" \
    --model meta-llama/Llama-2-70b-chat-hf \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 1 \
    --num-gpus 2

And on node2, to use its GPUs:

python -m vllm.entrypoints.api_server \
    --host 0.0.0.0 \
    --port 8002 \
    --worker-use-ray \
    --ray-address "auto" \
    --model meta-llama/Llama-2-70b-chat-hf \
    --tensor-parallel-size 4 \
    --pipeline-parallel-size 1 \
    --num-gpus 2

Notice the --tensor-parallel-size 4. This tells vLLM that the model’s weights will be split across 4 GPUs in total. Since we have 2 GPUs on node1 and 2 on node2, we’ve effectively distributed the model’s parameters across both machines. The --num-gpus 2 on each worker declaration specifies how many GPUs that specific worker process should use. The controller orchestrates these workers.

The problem this solves is simple: LLMs are huge. A 70B parameter model requires hundreds of gigabytes of VRAM, far exceeding what a single GPU can offer. Even if it could fit, the computational demands for inference would be immense. Multi-node serving allows you to pool the memory and compute resources of multiple machines.

Internally, vLLM uses libraries like Ray for distributed orchestration. When you specify tensor-parallel-size, vLLM partitions the model’s weight matrices. Each worker (which might be running on a different node) is responsible for a slice of these weights and performs computations on its assigned portion. The results are then communicated and aggregated across workers to produce the final output. This distribution happens transparently to the API client, which just sends requests to the controller.

The key levers you control are --tensor-parallel-size and --pipeline-parallel-size. Tensor parallelism splits individual layers across GPUs, ideal for reducing memory per GPU and increasing throughput for large models. Pipeline parallelism splits layers sequentially across GPUs, useful when you have many layers and want to overlap computation stages. For most large models, tensor parallelism is the primary method. The total tensor-parallel-size should equal the sum of GPUs you want to use across all your worker nodes.

When you set tensor-parallel-size to a value greater than the number of GPUs on a single node, vLLM automatically distributes the model weights across workers on different nodes. The Ray cluster handles the communication and scheduling between these workers. The controller acts as the central point of contact for incoming requests and routes them to the appropriate workers.

A common pitfall is miscalculating the total tensor-parallel-size. If you have 4 GPUs across two nodes and set --tensor-parallel-size 2, vLLM will try to split the model across just two GPUs, potentially leading to errors or inefficient distribution if those two GPUs are on the same node when you intended to use all four. Always ensure the sum of GPUs you intend to use across all workers matches your --tensor-parallel-size configuration.

The next step after mastering multi-node serving is often exploring advanced scheduling and load balancing strategies within a larger Ray cluster or integrating with Kubernetes for more robust deployment.

Want structured learning?

Take the full Vllm course →