The most surprising thing about TensorRT’s multi-GPU tensor parallelism is that it doesn’t actually split tensors across GPUs; it splits the operations on those tensors.
Imagine you have a massive neural network layer, like a large matrix multiplication, that’s too big to fit or compute efficiently on a single GPU. Tensor parallelism breaks this single operation into smaller chunks that can be processed concurrently on multiple GPUs. For instance, a large W matrix in Y = XW can be split column-wise across GPUs. If W is split into W1 and W2 on GPU1 and GPU2 respectively, then GPU1 computes Y1 = XW1 and GPU2 computes Y2 = XW2. The final result Y is then the concatenation of Y1 and Y2. This allows you to effectively increase the size of the model you can run by leveraging the combined computational power and memory of multiple GPUs.
Here’s a simplified example of how you might configure this in TensorRT. Let’s say you’re using an NvDsInferTensorMeta structure and you want to specify two GPUs for tensor parallelism.
#include <NvInfer.h>
#include <vector>
// Assume 'builder' is an initialized nvinfer1::IBuilder and 'network' is an nvinfer1::INetworkDefinition
// Create a builder config
nvinfer1::IBuilderConfig* builderConfig = builder->createBuilderConfig();
// Set maximum batch size
builderConfig->setMaxBatchSize(32);
// Enable optimizations for multi-GPU
builderConfig->setFlag(nvinfer1::BuilderFlag::kTF32); // Example optimization flag
builderConfig->setFlag(nvinfer1::BuilderFlag::kFP16); // Another optimization flag
// Specify the number of GPUs for tensor parallelism
// This is typically set implicitly by the runtime or explicitly via environment variables
// For demonstration, let's assume we are targeting 2 GPUs.
// When creating the TensorRT engine, you'd typically use a runtime.
// The runtime itself is aware of the available devices.
// If you were using trtexec, you'd specify devices like this:
// trtexec --onnx=model.onnx --deployments=2 --devices=0,1
// In C++ API, the runtime is initialized with a specific device context.
// To use multiple GPUs for a single engine, you'd often deploy separate engines
// on different GPUs and coordinate them, or if the model itself is designed for
// parallelism, the runtime can manage it.
// A common way to manage multi-GPU deployment is through the TensorRT plugin API
// or by manually splitting the model across engines.
// However, for native TensorRT tensor parallelism, the framework handles the splitting
// based on how the network is defined and the number of devices available to the runtime.
// Let's consider a scenario where a specific layer is marked for tensor parallelism.
// This is often done implicitly by TensorRT's optimization passes or explicitly
// through model graph manipulation if you're building the network programmatically.
// For example, if you have a large fully connected layer (FC):
// nvinfer1::ILayer* fcLayer = network->addFullyConnected(*inputTensor, numOutputFeatures, weightMatrix, biasVector);
// TensorRT's optimizer will analyze this layer. If it's large enough and multiple
// GPUs are available to the runtime, it might automatically split the weight matrix
// column-wise and distribute the computation.
// To explicitly control or observe this, you might look at the engine's plan.
// nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(network, builderConfig);
// After building, you can inspect the layers and their execution.
// For tensor parallelism, you won't see "tensor parallelism" as a direct layer type,
// but rather how a single large layer is decomposed.
// The key lever for the user is often the number of GPUs the TensorRT runtime
// is configured to use or is aware of. When you create an nvinfer1::IRuntime,
// it's typically associated with a CUDA device context. For multi-GPU, you
// might initialize multiple runtimes or ensure the runtime can access multiple devices.
// The runtime then uses this information to potentially parallelize operations.
// The actual split strategy is determined by TensorRT's internal heuristics,
// aiming to minimize communication overhead and maximize computation.
// Consider the communication patterns:
// If W is split into [W1 | W2] across two GPUs:
// GPU1 computes Y1 = XW1
// GPU2 computes Y2 = XW2
// The results Y1 and Y2 are then concatenated to form Y. This involves an
// `All-Gather` or `Concat` operation, which is a form of inter-GPU communication.
// The configuration `nvinfer1::BuilderFlag::kFP16` is crucial here.
// FP16 precision reduces memory footprint and can speed up computations,
// making it more feasible to split large layers and manage the communication.
// You can also specify the number of devices directly if you're using tools like trtexec:
// trtexec --onnx=model.onnx --deployments=2 --devices=0,1 --fp16
// The 'deployments' flag in trtexec hints at the number of parallel execution contexts,
// which can be leveraged for tensor parallelism if the model supports it.
// Let's wrap up the config part:
// builderConfig->setFlag(nvinfer1::BuilderFlag::kSAFETY); // Example safety flag
// nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(network, builderConfig);
// builderConfig->destroy();
// The runtime would then be created:
// nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(gLogger);
// nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(serializedEngine, engineSize);
// nvinfer1::IExecutionContext* context = engine->createExecutionContext();
// If you have multiple GPUs available and the engine is designed for it,
// the runtime can manage execution across them. The key is that the
// *layer* is split, not the tensor itself in a way that requires a full copy.
// When you execute inference, TensorRT maps parts of the computation graph
// to different GPUs based on the parallelism strategy it chose during optimization.
// The communication between GPUs is managed by CUDA streams and NCCL (NVIDIA Collective Communications Library).
// The 'deployments' parameter in trtexec, or equivalent multi-device setup in C++,
// signals to TensorRT that it has multiple processing units available and should
// consider strategies like tensor parallelism for large layers.
// The critical aspect is that TensorRT determines *which* layers are suitable for splitting.
// Typically, large fully-connected layers or convolutional layers with very large
// kernel sizes are candidates. The decision is based on profiling and estimating
// the communication overhead versus computation gains.
// If your model is too large for a single GPU's memory, tensor parallelism is essential.
// It allows you to scale the *width* (number of features/channels) of your network
// beyond single-GPU limits.
// The runtime will automatically orchestrate the necessary collective operations
// (like `all_gather` for column-wise split) to combine the partial results from each GPU.
// This happens implicitly once the engine is built with the understanding of multi-GPU execution.
// The primary benefit is increased throughput and the ability to run larger models.
// The trade-off is increased communication overhead, which TensorRT aims to minimize.
The most counterintuitive aspect of TensorRT's multi-GPU tensor parallelism is that the "parallelism" is applied at the operation level, not by simply replicating the tensor and processing it independently on each GPU. TensorRT breaks down a single large computation, like a massive matrix multiplication, into smaller, interdependent pieces that are executed concurrently.