TensorRT and ONNX Runtime are both powerful tools for accelerating deep learning inference on GPUs, but they approach optimization from fundamentally different angles, leading to distinct performance characteristics.

Let’s see TensorRT in action. Imagine you have a trained PyTorch model for image classification. Before TensorRT can work its magic, you need to convert your PyTorch model to a format it understands, typically ONNX.

import torch
import torchvision.models as models

# Load a pre-trained model
model = models.resnet18(pretrained=True)
model.eval()

# Create dummy input
dummy_input = torch.randn(1, 3, 224, 224)

# Export to ONNX
torch.onnx.export(model,
                  dummy_input,
                  "resnet18.onnx",
                  verbose=False,
                  input_names=['input'],
                  output_names=['output'])

Now, you have resnet18.onnx. This is where TensorRT comes in. You’d use the TensorRT Python API to build an optimized engine from this ONNX file.

import tensorrt as trt

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, TRT_LOGGER)

with open("resnet18.onnx", "rb") as model_file:
    if not parser.parse(model_file.read()):
        print("Failed to parse ONNX file")
        for error in range(parser.num_errors):
            print(parser.get_error(error))
        exit(1)

# Configure builder
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 28) # 256 MiB

# Build the engine
engine = builder.build_engine(network, config)

# Save the engine
with open("resnet18.trt", "wb") as f:
    f.write(engine.serialize())

This resnet18.trt file is a highly optimized, hardware-specific inference engine. When you run inference with it, you’ll typically see significantly lower latency and higher throughput compared to running the ONNX model directly.

The core problem TensorRT solves is taking a generic, framework-agnostic model representation (like ONNX) and transforming it into a highly specialized, low-level executable tailored for a specific NVIDIA GPU architecture. It performs aggressive graph optimizations, kernel fusion, precision calibration (FP16, INT8), and memory layout optimizations. ONNX Runtime, on the other hand, is a more general-purpose inference engine. It can execute ONNX models directly, leveraging various execution providers (like CUDA, TensorRT, OpenVINO, DirectML) to delegate computation to the most suitable hardware. While ONNX Runtime can use TensorRT as an execution provider, its primary goal is broad compatibility and ease of use across diverse hardware. TensorRT’s strength lies in its deep, hardware-specific optimization process that happens before runtime.

The surprising thing about TensorRT is that it doesn’t just "run" your model; it rebuilds it. The builder.build_engine() step is where the magic happens. TensorRT analyzes the ONNX graph, identifies redundant operations, fuses compatible layers (e.g., convolution, bias addition, and ReLU into a single kernel), and selects the most efficient CUDA kernels for your specific GPU. It can also perform quantization (e.g., converting FP32 weights to INT8) to further reduce model size and increase speed, often with minimal accuracy loss, but this requires a calibration step.

When you load a TensorRT engine, you’re not loading a generic model interpreter. You’re loading a pre-compiled, highly specialized binary designed to execute your exact model on your exact hardware with minimal overhead. This is why it often achieves state-of-the-art performance. The config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 28) line, for instance, tells TensorRT how much temporary memory it can use during the build phase to explore different optimization strategies. A larger workspace can sometimes lead to better optimizations but increases build time.

The next hurdle you’ll likely face is managing different TensorRT engine versions across various GPU architectures and driver versions, as engines are not backward compatible.

Want structured learning?

Take the full Tensorrt course →