Triton’s multi-GPU model parallelism lets you run models too big for a single GPU by splitting them across multiple devices, but it’s not just about distributing layers; it’s about orchestrating computations so they appear as a single, larger GPU.

Let’s look at a concrete example. Imagine a large transformer model. When you use model parallelism, you’re not just copying the model weights. You’re actually breaking down the model’s layers or specific operations and assigning them to different GPUs. For instance, GPU 0 might handle the first half of the attention mechanism, while GPU 1 handles the second half, and then GPU 2 takes on the feed-forward network. The tricky part is managing the data flow between these GPUs. Activations computed on GPU 0 need to be sent to GPU 1 for the next stage, and then those results are sent to GPU 2. This communication overhead is the primary bottleneck and the reason why simply splitting a model doesn’t automatically make it faster.

Consider a scenario where we have a model that, when loaded entirely onto a single GPU, would exceed its VRAM. We can use model parallelism to distribute this model across two A100 GPUs.

Here’s a simplified conceptual setup using Triton’s Python API:

import triton
import triton.language as tl

# Assume 'model_layer_1' and 'model_layer_2' are functions
# representing parts of your model that can be run independently
# and potentially on different devices.

# This is a highly simplified representation. In reality, you'd be
# dealing with weight matrices, activation tensors, and specific
# kernel implementations.

@triton.jit
def model_layer_1_kernel(
    x_ptr, y_ptr, weight_ptr,
    stride_x, stride_y, stride_w,
    N, BLOCK_SIZE: tl.constexpr
):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    # ... actual computation for layer 1 ...
    # Load input activations
    x = tl.load(x_ptr + offsets * stride_x)
    # Load weights
    w = tl.load(weight_ptr + offsets * stride_w)
    # Perform computation (e.g., matrix multiplication)
    y = x * w # Simplified: replace with actual matrix op
    # Store output activations
    tl.store(y_ptr + offsets * stride_y, y)

@triton.jit
def model_layer_2_kernel(
    x_ptr, y_ptr, weight_ptr,
    stride_x, stride_y, stride_w,
    N, BLOCK_SIZE: tl.constexpr
):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    # ... actual computation for layer 2 ...
    x = tl.load(x_ptr + offsets * stride_x)
    w = tl.load(weight_ptr + offsets * stride_w)
    y = x * w # Simplified
    tl.store(y_ptr + offsets * stride_y, y)

# --- In Python ---
# Assume 'weights_layer1' and 'weights_layer2' are loaded
# and potentially already on their respective GPUs.
# 'input_activations' is also on a GPU.

# Define the execution plan
# This is where Triton's model parallelism features would be invoked.
# The actual implementation involves specifying which kernel runs on which device.
# Triton allows you to launch kernels with device placement hints or
# manage device contexts explicitly.

# Example conceptual launch (not direct Triton API, but illustrates the idea):
# Triton's internal mechanisms would handle the device placement and data movement.

# Launch layer 1 on GPU 0
# triton.launch(model_layer_1_kernel, device=0, ...)

# The output of layer 1 (activations) needs to be available for layer 2.
# If layer 2 is on GPU 1, Triton's runtime would manage the CUDA stream
# and potentially inter-GPU communication (e.g., `cudaMemcpyAsync`
# with `cudaMemcpyDeviceToDevice`).

# Launch layer 2 on GPU 1
# triton.launch(model_layer_2_kernel, device=1, ...)

# The final output is then available on GPU 1.

The core problem model parallelism solves is memory capacity. When a model’s parameters and intermediate activations exceed the VRAM of a single GPU, you have no choice but to distribute them. Triton, by allowing fine-grained kernel control and device placement, enables you to implement this distribution efficiently, minimizing the communication overhead that plagues naive approaches.

Internally, Triton’s model parallelism relies on its ability to manage multiple CUDA contexts and streams. When you define a model that spans multiple GPUs, Triton’s runtime is responsible for:

  1. Kernel Placement: Ensuring that the correct kernels (your defined Triton functions) are launched on the designated GPUs.
  2. Data Transfer: Orchestrating the asynchronous transfer of activation tensors between GPUs. This is critical. If GPU 0 finishes computing its part and has to wait for the data to be fully transferred to GPU 1 before GPU 1 can start, you lose all the benefits of parallelism. Triton aims to overlap computation and communication as much as possible.
  3. Synchronization: Making sure that operations on one GPU are complete before the next GPU tries to use their results, and that the overall forward/backward pass completes correctly.

The exact levers you control are primarily at the kernel definition and launch level. You define kernels that operate on specific tensor slices or perform specific model stages. Then, you instruct Triton (or manage the execution context) to assign these kernels and their associated data to particular GPUs. This might involve setting environment variables like CUDA_VISIBLE_DEVICES or using Triton’s API features for device selection.

The surprising part is how much of the communication optimization Triton can handle under the hood. It’s not just about telling it "run this on GPU 0 and that on GPU 1." Triton’s compiler and runtime can analyze the dependencies between kernels and the memory access patterns to schedule data transfers and kernel launches in a way that maximizes GPU utilization and minimizes idle time. For instance, while GPU 1 is busy processing its assigned part of the model, Triton might already be initiating the data transfer from GPU 0 to GPU 1 for the next batch of data, or even prefetching weights.

The next concept you’ll run into is pipeline parallelism, which is another strategy for handling large models but focuses on a different kind of parallelism: executing different stages of the model concurrently across GPUs for different micro-batches.

Want structured learning?

Take the full Triton course →