TensorRT PyTorch Model Optimization: torch2trt Guide (2026)

torch2trt is a PyTorch extension that converts PyTorch models into TensorRT engines, allowing for optimized inference on NVIDIA GPUs.

Let’s see it in action. Imagine you have a trained PyTorch model, say a simple convolutional neural network for image classification. You’ve just finished training and want to deploy it for faster inference.

import torch
import torch.nn as nn
import tensorrt as trt
from torch2trt import torch2trt

# Define a simple PyTorch model
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc = nn.Linear(16 * 16 * 16, 10) # Assuming input image size 3x32x32

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

# Instantiate the model
model = SimpleCNN()
model.eval() # Set to evaluation mode

# Create dummy input data
# Batch size of 1, 3 color channels, 32x32 image dimensions
dummy_input = torch.randn(1, 3, 32, 32)

# Convert the PyTorch model to a TensorRT engine
# Specify input/output names and data types
# `fp16_mode=True` enables mixed-precision inference for further speedup
trt_model = torch2trt(model, [dummy_input], fp16_mode=True,
                      input_names=["input_tensor"], output_names=["output_tensor"])

# Now, `trt_model` is a TensorRT engine that can be used for inference
# You can save this engine to a file for later use
# For example: trt_model.save("simple_cnn.engine")

print("Model successfully converted to TensorRT engine.")

# To perform inference with the TensorRT engine:
# You would typically load the engine and create an execution context.
# For demonstration, we'll just show the structure.
# Note: Direct inference with the `trt_model` object here is for illustration.
# In a real application, you'd use `trt_model.infer()` or a dedicated TensorRT runtime.

# Example of how inference would look conceptually (requires TensorRT runtime setup)
# context = trt_model.create_execution_context()
# input_host = dummy_input.cpu().numpy()
# output_host = np.empty(output_shape, dtype=np.float32) # output_shape needs to be known
# bindings = [input_host, output_host]
# stream = cuda.Stream()
# cuda.memcpy_htod_async(bindings[0], input_host, stream)
# context.execute_async_v2(bindings, stream.handle)
# cuda.memcpy_dtoh_async(output_host, bindings[1], stream)
# stream.synchronize()
# result = torch.from_numpy(output_host)

The core problem torch2trt solves is bridging the gap between the flexibility of PyTorch’s dynamic graph and the performance demands of deploying models on hardware accelerators. PyTorch, while amazing for research and development, often incurs overhead due to its eager execution and Python interpreter. TensorRT, on the other hand, is a highly optimized inference optimizer and runtime for NVIDIA GPUs. It performs graph optimizations like layer fusion, kernel auto-tuning, and precision calibration, resulting in significant speedups and reduced latency. torch2trt automates much of this conversion process.

Internally, torch2trt works by tracing the execution of your PyTorch model with a given input. This tracing builds a computational graph that represents the model’s operations. torch2trt then translates this PyTorch graph into a TensorRT graph. This translation involves mapping PyTorch operations (like nn.Conv2d, nn.ReLU, nn.Linear) to their corresponding TensorRT counterparts. During this process, it can also perform optimizations like:

Layer Fusion: Combining multiple layers (e.g., convolution, bias addition, and ReLU activation) into a single, more efficient kernel.
Kernel Auto-tuning: Selecting the best-performing CUDA kernels for the target GPU architecture.
Precision Calibration: Quantizing floating-point weights and activations to lower precision (like FP16 or INT8) for faster computation and reduced memory footprint, while minimizing accuracy loss.
Memory Optimization: Efficiently managing memory allocations for inputs, outputs, and intermediate tensors.

The levers you control are primarily through the torch2trt function arguments:

model: Your PyTorch nn.Module instance.
inputs: A list of dummy input tensors that define the shape and data type of your model’s inputs. This is crucial for tracing the graph.
fp16_mode: A boolean to enable FP16 (half-precision) inference. This can nearly double throughput on compatible GPUs.
int8_mode: A boolean to enable INT8 (8-bit integer) inference. This offers the highest performance gains but requires a calibration dataset to determine optimal quantization parameters.
max_workspace: The maximum GPU memory (in bytes) that TensorRT can use for intermediate computations. A larger workspace can sometimes enable more aggressive optimizations.
input_names and output_names: String identifiers for your model’s inputs and outputs, which are useful when working with the TensorRT engine directly.

One aspect that often surprises people is how sensitive the tracing process is to the exact input shape and type. If your model has branches that are conditionally executed based on input dimensions, or if it uses dynamic shapes in ways that aren’t fully captured by a single trace, torch2trt might not be able to convert the entire graph. In such cases, you might need to manually specify input shapes or use TensorRT’s explicit batch dimension feature, or even resort to a hybrid approach where only parts of the model are converted. The torch2trt function attempts to infer shapes and types, but providing a representative dummy_input is key. If your model dynamically changes output shapes based on input, torch2trt might struggle to determine the output tensor dimensions for the TensorRT engine without additional hints or a fixed output shape assumption.

Once you have a TensorRT engine, you’ll typically need to serialize it to a file and then load it using the TensorRT Python API for inference, rather than directly calling methods on the trt_model object returned by torch2trt. This allows for more robust deployment scenarios.

More Deep Dives in Tensorrt