TensorRT doesn’t actually install like a typical library; it’s more of a sophisticated compilation toolkit that leverages existing CUDA and cuDNN installations.

Let’s see TensorRT in action with a simple Python example. Imagine we have a pre-trained PyTorch model for image classification. We want to optimize it for faster inference on an NVIDIA GPU using TensorRT.

First, we need to convert our PyTorch model to the ONNX format.

import torch
import torchvision.models as models

# Load a pre-trained model
model = models.resnet18(pretrained=True)
model.eval()

# Create a dummy input tensor
dummy_input = torch.randn(1, 3, 224, 224, device='cuda')

# Export the model to ONNX
torch.onnx.export(model,
                  dummy_input,
                  "resnet18.onnx",
                  export_params=True,
                  opset_version=11,
                  do_constant_folding=True,
                  input_names=['input'],
                  output_names=['output'],
                  dynamic_axes={'input': {0: 'batch_size'},
                                'output': {0: 'batch_size'}})
print("Model exported to resnet18.onnx")

Now, we use the TensorRT trtexec command-line tool to build an optimized TensorRT engine from this ONNX file. trtexec is a powerful utility that not only converts ONNX to a TensorRT engine (.plan file) but also benchmarks its performance.

trtexec --onnx=resnet18.onnx --saveEngine=resnet18.plan --fp16 --batch=1

Here’s what’s happening:

  • --onnx=resnet18.onnx: Specifies the input ONNX model.
  • --saveEngine=resnet18.plan: Tells trtexec to save the optimized TensorRT engine to this file.
  • --fp16: Instructs TensorRT to use mixed-precision (FP16) inference, which significantly speeds up computation and reduces memory usage on compatible GPUs.
  • --batch=1: Sets a specific batch size for which to optimize the engine. TensorRT can optimize for different batch sizes, and sometimes creating engines optimized for specific batch sizes yields better performance than using dynamic batching.

Once the .plan file is generated, we can load it into a TensorRT runtime for inference.

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit # Initializes CUDA context
import numpy as np

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

# Load the TensorRT engine
with open("resnet18.plan", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(f.read())

# Create an execution context
context = engine.create_execution_context()

# Prepare input and output buffers
inputs, outputs, bindings, stream = allocate_buffers(engine)

# Perform inference
# (Assume 'inputs[0]' is populated with actual image data)
# For demonstration, let's just fill it with dummy data
inputs[0].host[:] = np.random.rand(*inputs[0].shape).astype(np.float32)

# Transfer input data to the GPU
for inp in inputs:
    cuda.memcpy_htod_async(inp.device, inp.host, stream)

# Run inference
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)

# Transfer output data from the GPU
for out in outputs:
    cuda.memcpy_dtoh_async(out.host, out.device, stream)

# Synchronize the stream
stream.synchronize()

# Process the output
# (Assume 'outputs[0]' now contains the inference results)
print("Inference complete. Output shape:", outputs[0].host.shape)

def allocate_buffers(engine):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append({'host': host_mem, 'device': device_mem, 'shape': engine.get_binding_shape(binding)})
        else:
            outputs.append({'host': host_mem, 'device': device_mem, 'shape': engine.get_binding_shape(binding)})
    return inputs, outputs, bindings, stream

The core problem TensorRT solves is bridging the gap between a trained neural network model and efficient, low-latency inference on NVIDIA GPUs. It achieves this by performing a series of optimizations that a standard deep learning framework might not do, or might not do as aggressively. These include layer fusion, kernel auto-tuning, and precision calibration.

TensorRT’s "mental model" revolves around an engine. You don’t just link against a TensorRT library and call functions; you create an engine that is specific to your model, your target GPU, and your chosen precision (FP32, FP16, INT8). This engine is a highly optimized, serialized artifact that can be loaded and executed very quickly. The process of creating this engine is called building.

The primary levers you control are:

  1. Input Model Format: ONNX is the most common intermediate format, but TensorRT also supports UFF (for TensorFlow 1.x) and native framework plugins.
  2. Target GPU Architecture: TensorRT selects optimized kernels based on the compute capability of your GPU.
  3. Precision: FP32, FP16 (mixed precision), and INT8 (quantized inference). FP16 is a great balance of performance and accuracy. INT8 requires a calibration step to determine quantization ranges, which can be complex but offers the highest throughput.
  4. Batch Size: Optimizing for a fixed batch size usually yields better performance than dynamic batching, though dynamic batching offers more flexibility.
  5. TensorRT Version: Newer versions often bring performance improvements and support for new operators.

The most surprising thing about TensorRT is how it can sometimes outperform the original framework’s inference by essentially "hardcoding" many operations and memory movements specific to your model and hardware. It’s less of a dynamic execution engine and more of a static, highly specialized compiler for inference.

The next concept you’ll likely encounter is optimizing for INT8 precision, which involves a calibration process.

Want structured learning?

Take the full Tensorrt course →