Pipeline parallelism in vLLM is a technique that lets you serve large language models that wouldn’t fit into a single GPU’s memory by splitting the model’s layers across multiple GPUs.
Let’s see this in action. Imagine we have a massive model and two GPUs. Instead of loading the whole model onto one, we’ll put the first half of the layers on GPU 0 and the second half on GPU 1.
from vllm import LLM, SamplingParams
# Assume model_name is a very large model that won't fit on one GPU
model_name = "meta-llama/Llama-2-70b-chat-hf"
# Specify the number of GPUs and enable pipeline parallelism
# vLLM automatically determines how to split layers if tensor_parallel_size > 1
llm = LLM(model=model_name, tensor_parallel_size=2)
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50)
# Run inference
outputs = llm.generate(prompts, sampling_params)
# Print the outputs
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
This setup allows vLLM to distribute the computational load and memory requirements. When a request comes in, the first GPU processes its layers and passes the intermediate activations to the second GPU, which then continues the computation. This is a form of model parallelism where the model itself is partitioned.
The core problem this solves is the memory and compute limitations of a single GPU. Modern LLMs, especially those with tens or hundreds of billions of parameters, far exceed the VRAM of even the most powerful consumer or enterprise GPUs. By splitting the model, each GPU only needs to hold a fraction of the model’s weights and perform a fraction of the computations for each layer.
Internally, vLLM manages this by creating separate model instances, each responsible for a subset of the model’s layers. The communication between these instances is handled efficiently. For pipeline parallelism, vLLM’s tensor_parallel_size parameter is key. When tensor_parallel_size is greater than 1, vLLM automatically partitions the model’s layers across the specified number of GPUs. It determines the split points based on the model architecture and the number of available GPUs.
The exact levers you control are primarily the tensor_parallel_size argument during LLM initialization. Setting this to the number of GPUs you want to distribute the model across is the primary mechanism. You don’t manually assign layers to specific GPUs; vLLM handles that partitioning.
The efficiency of pipeline parallelism heavily relies on balancing the computational load across the GPUs. If one GPU has significantly more layers or more computationally intensive layers than others, it can become a bottleneck, slowing down the entire pipeline. vLLM attempts to create balanced partitions, but for custom or highly irregular model architectures, manual tuning might be necessary by adjusting how the model is loaded or by using techniques like tensor parallelism where weights are sharded across GPUs rather than layers being split.
The next step after mastering pipeline parallelism is exploring techniques like tensor parallelism or optimizing inference with methods like quantization.