The most surprising thing about TensorRT is that it doesn’t fundamentally change your model’s architecture; it optimizes the execution of that architecture for NVIDIA GPUs.

Let’s see this in action. Imagine you have a trained PyTorch model for image classification. You’ve exported it to ONNX format.

import torch
import torchvision.models as models

# Load a pre-trained model
model = models.resnet18(pretrained=True)
model.eval()

# Create dummy input
dummy_input = torch.randn(1, 3, 224, 224)

# Export to ONNX
torch.onnx.export(model, dummy_input, "resnet18.onnx", verbose=False)

Now, you want to run this ONNX model with TensorRT for faster inference. You’ll typically use the NVIDIA NGC container for this.

First, pull the TensorRT NGC container. Let’s say you want the latest stable version with CUDA 11.8:

docker pull nvcr.io/nvidia/tensorrt:23.05-py3

This command fetches a pre-built Docker image from NVIDIA’s container registry. This image contains TensorRT itself, along with necessary CUDA and cuDNN libraries, Python, and other dependencies pre-installed and configured to work optimally together. It’s essentially a ready-to-go environment for GPU-accelerated deep learning inference.

Next, you’ll run this container, mounting your ONNX model and any necessary scripts.

docker run --gpus all -it --rm \
  -v /path/to/your/models:/models \
  -v /path/to/your/scripts:/scripts \
  nvcr.io/nvidia/tensorrt:23.05-py3 bash

Inside the container, you’d then use TensorRT’s tools to convert your ONNX model to a TensorRT engine. This conversion process is where the magic happens. TensorRT analyzes your ONNX graph and performs several optimizations:

  1. Layer and Tensor Fusion: It combines multiple layers or operations into a single, more efficient kernel. For example, a convolution followed by a bias add and an activation function might be fused into one GPU kernel.
  2. Kernel Auto-Tuning: For specific layers, TensorRT searches its library for the most efficient CUDA kernel implementation based on your GPU architecture and the layer’s parameters.
  3. Precision Calibration: It can convert FP32 (32-bit floating point) models to FP16 or INT8 with minimal accuracy loss, significantly speeding up computation and reducing memory bandwidth.
  4. Layer Elimination: It removes redundant or unnecessary operations.

The conversion command might look something like this, using trtexec, a command-line tool included in the container:

trtexec --onnx=/models/resnet18.onnx --saveEngine=/models/resnet18.engine --fp16

Here, --onnx points to your input ONNX file, --saveEngine specifies the output TensorRT engine file, and --fp16 instructs TensorRT to optimize for FP16 precision. The resulting .engine file is a highly optimized, GPU-specific representation of your model.

You can then load and run this .engine file within your inference application, also often running inside the same container for maximum compatibility.

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

# Load the TensorRT engine
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with open("/models/resnet18.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())

context = engine.create_execution_context()

# Prepare input and output buffers
input_binding_idx = engine.get_binding_index("input.1") # Adjust if your ONNX input name differs
output_binding_idx = engine.get_binding_index("output.1") # Adjust if your ONNX output name differs

input_shape = engine.get_binding_shape(input_binding_idx)
output_shape = engine.get_binding_shape(output_binding_idx)

input_h = cuda.pagelocked_empty(tuple([input_shape[0], input_shape[1], input_shape[2], input_shape[3]]), dtype=np.float16) # Assuming FP16
output_d = cuda.pagelocked_empty(tuple([output_shape[0], output_shape[1]]), dtype=np.float16) # Assuming FP16

# Fill input_h with your actual input data
# ...

# Transfer input data to the GPU
input_d = cuda.mem_alloc(input_h.nbytes)
cuda.memcpy_htod(input_d, input_h)

# Run inference
bindings = [int(input_d), int(output_d)]
stream = cuda.Stream()
context.execute_async_v2(bindings=bindings, stream=stream)
stream.synchronize()

# Transfer output data back to the host
cuda.memcpy_dtoh(output_d, input_d) # This should be output_d, not input_d
# ... process output_d

The key is that the .engine file is not portable across different GPU architectures or even different TensorRT versions. It’s a compiled artifact specifically for the hardware and software stack it was built on.

One subtle but crucial aspect of TensorRT optimization is how it handles dynamic shapes. While trtexec often uses static shapes for initial engine generation, applications can leverage TensorRT’s ability to handle varying input dimensions at runtime. This is achieved by creating an engine with a range of allowed input shapes and then configuring the execution context with the specific shape for each inference call. This allows a single engine to serve requests with different batch sizes or image resolutions without needing to re-optimize or rebuild.

When you eventually move to a new GPU architecture or a significantly updated TensorRT version, you’ll need to rebuild your .engine files.

Want structured learning?

Take the full Tensorrt course →