TensorRT’s layer fusion is less about combining layers and more about aggressively optimizing the computational graph to eliminate overhead and maximize hardware utilization.

Let’s see this in action. Imagine a simple sequence of operations: a convolution followed by a bias addition and then a ReLU activation. In a naive implementation, these would be separate kernel launches, each with its own overhead.

# Hypothetical PyTorch code representing the operations
conv_layer = torch.nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
bias_param = torch.nn.Parameter(torch.randn(16))
relu_activation = torch.nn.ReLU()

# ... forward pass ...
output_conv = conv_layer(input_tensor)
output_bias = output_conv + bias_param.unsqueeze(-1).unsqueeze(-1) # Add bias
output_relu = relu_activation(output_bias)

When TensorRT builds an optimized engine, it looks at this sequence and realizes that the output of the convolution is directly fed into the bias addition, and that output is directly fed into the ReLU. It can then fuse these operations.

Here’s how TensorRT might represent this internally after fusion (conceptual, not actual code):

// Conceptual TensorRT fused kernel
void fused_conv_bias_relu_kernel(float* input, float* output, const float* weights, const float* bias, int height, int width) {
    // Iterate over output elements
    for (int h = 0; h < height; ++h) {
        for (int w = 0; w < width; ++w) {
            // Perform convolution for this output element
            float conv_result = 0.0f;
            // ... (convolution logic) ...

            // Add bias
            float bias_added = conv_result + bias[output_channel_index];

            // Apply ReLU
            output[h * width * num_channels + w * num_channels + c] = std::max(0.0f, bias_added);
        }
    }
}

The goal is to produce a single, highly optimized kernel that performs all three operations in one go. This eliminates the need for intermediate memory writes and reads, as well as the overhead of launching multiple kernels. The entire computation for a single output element happens within the registers of a CUDA core, leading to significant performance gains.

The core problem TensorRT’s layer fusion solves is the overhead associated with the deep learning framework’s execution model. Frameworks like PyTorch or TensorFlow often represent networks as a directed acyclic graph (DAG) of operations. When you run inference, the framework iterates through this DAG, launching kernels for each node. This involves:

  1. Kernel Launch Overhead: Each torch.nn.Conv2d or torch.nn.ReLU call translates to a kernel launch command. There’s a fixed cost associated with scheduling and launching a CUDA kernel.
  2. Memory Bandwidth: Intermediate results between operations need to be written to and read from global memory. For a sequence like Conv -> Bias -> ReLU, the output of the convolution is written, then read back, then the bias-added result is written, then read back for ReLU. This consumes valuable memory bandwidth.
  3. Compute Underutilization: If individual kernels are small or don’t fully saturate the GPU’s compute units, the GPU might sit idle waiting for the next kernel.

TensorRT’s IBuilderConfig and INetworkDefinition are your primary levers. When you define your network using INetworkDefinition, TensorRT analyzes the dependencies. By default, it attempts to fuse compatible layers. You can influence this with builder_config.set_flag(BuilderFlag::kFP16) or kINT8 to enable mixed-precision or quantization, which often enables more fusion opportunities because operations that might not fuse in FP32 can fuse in lower precisions. The set_max_workspace_size also indirectly impacts fusion as a larger workspace can allow TensorRT to explore more complex optimization strategies, including more aggressive fusion.

The fusion process isn’t arbitrary. TensorRT applies a set of rules based on the CUDA architecture and the specific operations. For instance, element-wise operations (like addition, multiplication, or activation functions) are prime candidates for fusion with preceding operations (like convolutions or matrix multiplications) if their inputs and outputs align perfectly. This is because they can be easily incorporated into the existing data path of the larger operation. Layer normalization, batch normalization, and element-wise operations are common targets for fusion into preceding convolutional or fully connected layers.

A key aspect of TensorRT’s fusion is its ability to create new kernels that are specifically tailored to the fused sequence. It doesn’t just "glue" existing kernels together; it generates code that performs the fused operations as efficiently as possible, often using specialized CUDA intrinsics and optimizing memory access patterns. This is why simply looking at a list of operations in a framework doesn’t tell you the full story of what TensorRT will execute. The output of polygraphy inspect model --model model.onnx --print-layers can reveal fused layers, but it’s the TensorRT engine itself that contains the truly optimized, fused kernels.

What most people don’t realize is that the fusion of a bias add operation with a convolution can be lossy if using FP16 precision for the convolution and the bias values themselves are not perfectly representable or if the bias is added after the ReLU in certain architectures. TensorRT, however, is designed to manage these potential precision differences by carefully ordering operations and choosing appropriate kernel implementations to minimize accuracy degradation. The fusion is done such that the mathematical equivalence is maintained as closely as possible within the target precision.

The next step after understanding layer fusion is exploring how TensorRT performs kernel auto-tuning to select the most efficient kernel implementation for a given layer (fused or not) on your target hardware.

Want structured learning?

Take the full Tensorrt course →