The most surprising thing about TensorRT on Jetson is that it’s not just about making your neural nets faster; it’s about making them run at all on hardware that’s a tiny fraction of the power of a desktop GPU.

Let’s see it in action. Imagine you’ve trained a YOLOv5 model on your workstation and now want to run it on a Jetson Nano for real-time object detection.

First, you’ll need to convert your PyTorch .pt model to a ONNX format, which is a more universal intermediate representation.

python export.py --weights yolov5s.pt --include onnx

This will spit out yolov5s.onnx. Now, this ONNX file is still too large and unoptimized for the Jetson’s limited resources. This is where TensorRT’s core magic happens: building an engine.

You’ll use the trtexec command-line tool, which is part of the TensorRT installation on your Jetson. It takes your ONNX model and a set of optimization flags to produce a .plan file – the TensorRT engine.

trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s.plan --fp16 --workspace=1024

Let’s break down those flags:

  • --onnx=yolov5s.onnx: This is your input model.
  • --saveEngine=yolov5s.plan: This is the output engine file. TensorRT will serialize all its optimizations into this binary file.
  • --fp16: This is crucial for edge devices. It tells TensorRT to use 16-bit floating-point precision instead of the default 32-bit. This halves the model size and dramatically speeds up computation on hardware like the Jetson’s Volta/Turing/Ampere GPUs, which have specialized FP16 acceleration. You’ll see a significant jump in throughput and a reduction in memory usage.
  • --workspace=1024: This specifies the maximum GPU memory (in MB) that TensorRT can use for intermediate computations during inference. Finding the right balance here is key; too little, and it might fail to build the engine; too much, and you leave less memory for your application.

Once trtexec finishes, you’ll have yolov5s.plan. You can then load this engine in your C++ or Python application using the TensorRT runtime API.

import pycuda.driver as cuda
import tensorrt as trt
import numpy as np

# Initialize CUDA and TensorRT
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
cuda.init()
device = cuda.Device(0)
ctx = device.make_context()
runtime = trt.Runtime(TRT_LOGGER)

# Load the engine
with open("yolov5s.plan", "rb") as f:
    engine = runtime.deserialize_cuda_engine(f.read())
context = engine.create_execution_context()

# ... (rest of your inference code: allocate buffers, copy input, execute, copy output)

The entire workflow is about transforming a general-purpose neural network graph into a highly optimized, hardware-specific execution plan. TensorRT performs several optimization passes:

  1. Layer and Tensor Fusion: It merges layers that can be computed together into single kernels, reducing kernel launch overhead and memory bandwidth usage. For example, a convolution followed by a bias addition and an activation function like ReLU might be fused into one operation.
  2. Kernel Auto-Tuning: It selects the most efficient CUDA kernels for the target hardware and the specific layer parameters.
  3. Precision Calibration: For INT8 inference (though we used FP16 here), it calibrates the model to determine the optimal quantization ranges. This is critical for maximizing speed while minimizing accuracy loss.
  4. Layer Optimization: It replaces certain layers with more efficient implementations or kernel combinations that are faster on the target architecture.

The key to understanding TensorRT is realizing that it doesn’t just run your existing model; it rebuilds it for the specific Jetson GPU. This rebuilding process involves making low-level CUDA kernel choices and memory layout optimizations that are invisible to frameworks like PyTorch or TensorFlow when running on a desktop. The trtexec tool is your window into this optimization process, allowing you to experiment with different precision modes (FP32, FP16, INT8) and workspace sizes to find the best trade-off between speed, accuracy, and memory consumption for your specific application.

When you run trtexec, you’ll notice it spits out a lot of performance metrics. Pay close attention to the "Total latency" and "Throughput" numbers. These are your primary indicators of optimization success. If you’re seeing high latency or low throughput, it’s often a sign that the FP16 conversion didn’t go as smoothly as hoped, or that the fusion and kernel selection didn’t yield the expected gains. You might then need to investigate specific layers or consider INT8 quantization if accuracy permits.

After optimizing your model with TensorRT and getting it running, the next hurdle you’ll likely face is managing the inference pipeline efficiently, especially when dealing with multiple streams or complex pre/post-processing steps that can become the new bottleneck.

Want structured learning?

Take the full Tensorrt course →