Tensor parallelism splits a single large model layer across multiple GPUs, allowing you to run models that wouldn’t fit on a single GPU or to speed up inference by distributing computation.

Let’s see vLLM’s tensor parallelism in action with a simple example. Imagine we have a model with a large linear layer. In a single-GPU setup, this layer might be nn.Linear(1024, 4096). With tensor parallelism, we can split this layer. For instance, on two GPUs, we might split the output dimension:

GPU 0: nn.Linear(1024, 2048) GPU 1: nn.Linear(1024, 2048)

The input x (shape [batch_size, 1024]) is sent to both GPUs. Each GPU computes its half of the output: GPU 0: out_0 = x @ W_0 (shape [batch_size, 2048]) GPU 1: out_1 = x @ W_1 (shape [batch_size, 2048])

Then, the outputs are concatenated along the output dimension: out = torch.cat([out_0, out_1], dim=-1), resulting in [batch_size, 4096], the same shape as if it were on a single GPU.

The magic happens when you configure vLLM to use multiple GPUs for tensor parallelism. You typically do this by setting the tensor_parallel_size (or tp) in your vLLM engine configuration.

from vllm import LLM, SamplingParams

# Assuming you have two GPUs available (e.g., GPU 0 and GPU 1)
# You would typically launch this script with torchrun or similar
# to ensure each process is assigned to a specific GPU.

# Example configuration for two-way tensor parallelism
# If running on a single node with 2 GPUs, this is often handled by the launcher.
# For multi-node, you'd also configure distributed setup.

# For a single node with 2 GPUs, vLLM can often infer, but explicit is better.
# If you were using the OpenAI-compatible server, you'd set it via environment variable
# or command line flag.
# Example: python -m vllm.entrypoints.openai.api_server --tensor-parallel-size 2

# In a Python script, you'd pass it to the LLM constructor:
try:
    llm = LLM(model="meta-llama/Llama-2-7b-hf", tensor_parallel_size=2)
    print("LLM initialized with tensor_parallel_size=2")

    sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)

    prompts = [
        "What is the capital of France?",
        "Write a short story about a robot learning to love.",
    ]

    outputs = llm.generate(prompts, sampling_params)

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

except Exception as e:
    print(f"An error occurred: {e}")
    print("Please ensure you have multiple GPUs available and that your environment")
    print("is correctly configured for multi-GPU execution (e.g., using torchrun).")

When you run this code, vLLM internally partitions the model’s weights and computations for layers that benefit from tensor parallelism (typically large linear layers and attention mechanisms). The tensor_parallel_size=2 tells vLLM to divide the model’s parameters and computations across 2 GPUs. Each GPU will hold a portion of the model’s weights and perform a subset of the calculations for a given layer. For instance, a weight matrix W of shape [in_features, out_features] might be split column-wise on GPU 0 to W_0 (shape [in_features, out_features/tp_size]) and on GPU 1 to W_1 (shape [in_features, out_features/tp_size]), and so on. The forward pass involves an all-gather or similar collective communication operation to reconstruct the full output of the layer before passing it to the next.

The core problem tensor parallelism solves is fitting very large models into the collective memory of multiple GPUs. If a model’s parameters alone exceed the VRAM of a single GPU, it’s impossible to load. Tensor parallelism breaks these parameters down. It also speeds up inference because the computation for a single layer is distributed. Instead of one GPU doing all the matrix multiplications, multiple GPUs do smaller matrix multiplications concurrently. The communication overhead between GPUs for these partial results is the trade-off, and vLLM’s optimized communication primitives (like fused operations and efficient collective calls) aim to minimize this.

The exact levers you control are primarily tensor_parallel_size and pipeline_parallel_size (if you’re also doing pipeline parallelism, which splits layers sequentially across GPUs). For tensor parallelism, the key is selecting a tensor_parallel_size that matches the number of GPUs you want to use for splitting individual layers. This is often the same as the total number of GPUs if you’re not using pipeline parallelism, or a divisor of the total if you are. vLLM automatically identifies which layers can be parallelized.

When it comes to distributing large models, the most surprising thing is how early in the model architecture tensor parallelism can be applied. It’s not just reserved for the final output layers; critical components like the feed-forward networks (FFNs) within transformer blocks, which often contain the largest linear layers, are prime candidates. vLLM’s implementation is designed to automatically detect and parallelize these massive components, ensuring that even the computational bottlenecks within each transformer layer are distributed, maximizing throughput. This automatic parallelization of FFNs is a significant contributor to vLLM’s performance gains over simpler parallelism strategies.

The next concept you’ll likely encounter is pipeline parallelism, which further distributes models by assigning sequential layers or blocks of layers to different GPUs.

Want structured learning?

Take the full Vllm course →