trtexec is your go-to command-line utility for squeezing every last drop of performance out of TensorRT. It’s not just about running a model; it’s about understanding its latency, throughput, and memory footprint under various conditions, letting you tune it for your specific hardware and application.

Let’s see trtexec in action. Imagine you have a PyTorch-exported ONNX model for image classification, resnet50.onnx. You want to benchmark it on your NVIDIA T4 GPU.

trtexec --onnx=resnet50.onnx --saveEngine=resnet50.trt --fp16 --iterations=1000 --duration=10 --percentile=99

This command does a lot. It takes resnet50.onnx, builds a TensorRT engine saving it to resnet50.trt, and crucially, it uses FP16 precision (--fp16) for potentially faster inference. It will run 1000 iterations or for a total of 10 seconds (--duration=10), whichever comes first, and then report statistics based on the 99th percentile of latency (--percentile=99).

The output will look something like this:

...
[03/15/2024-10:30:00] INFO: [TRT] TensorRT version: 8.6.1
[03/15/2024-10:30:00] INFO: [TRT] Building engine from onnx file resnet50.onnx
[03/15/2024-10:30:05] INFO: [TRT] Detected 1 input and 1 output optimization profiles.
[03/15/2024-10:30:05] INFO: [TRT] Engine built in 5.123 seconds.
...
[03/15/2024-10:30:15] INFO: [TRT] === Performance Report ===
[03/15/2024-10:30:15] INFO: [TRT] Throughput: 1500.50 i/s
[03/15/2024-10:30:15] INFO: [TRT] Latency (ms): avg = 0.666, min = 0.500, max = 1.200, median = 0.600, percentile(99%) = 1.100
[03/15/2024-10:30:15] INFO: [TRT] CPU Plot Latency: 0.500 ms
[03/15/2024-10:30:15] INFO: [TRT] GPU Plot Latency: 0.666 ms
[03/15/2024-10:30:15] INFO: [TRT] Memory: 1200 MiB
...

This tells you your engine processed 1500 images per second with an average latency of 0.666 milliseconds. The 99th percentile latency of 1.100 ms means that 99% of your inferences completed within that time, which is critical for real-time applications where occasional slow inferences can be unacceptable. It also reports the GPU memory consumed by the engine.

The core problem trtexec solves is the gap between a model defined in a framework (like PyTorch or TensorFlow) and its optimized execution on NVIDIA hardware. Frameworks are flexible but often not specialized for high-performance inference. TensorRT, on the other hand, takes a trained model and applies numerous optimizations:

  • Layer and Tensor Fusion: Combines multiple operations into a single kernel, reducing kernel launch overhead and memory access.
  • Kernel Auto-Tuning: Selects the most efficient CUDA kernels for your specific GPU architecture and input shapes.
  • Precision Calibration: Allows inference in lower precision (FP16, INT8) with minimal accuracy loss, drastically improving speed and reducing memory.
  • Dynamic Tensor Memory: Optimizes memory allocation by only reserving memory for tensors that are active during execution.

trtexec exposes these capabilities. You control the optimization process through its flags. Key levers include:

  • --onnx, --uff, --caffe2: Specifies the input model format.
  • --saveEngine: Saves the optimized TensorRT engine.
  • --loadEngine: Loads a pre-built engine for faster subsequent runs.
  • --fp16, --int8: Enables lower precision inference. For INT8, you’ll also need calibration data (--calib=<path_to_calibrator>).
  • --batch=<N>: Sets a static batch size for the engine.
  • --minShapes, --optShapes, --maxShapes: Crucial for dynamic batching and variable input sizes. For example, --optShapes='input_tensor:1x3x224x224' tells TensorRT to optimize for a batch size of 1, while --minShapes and --maxShapes define the range for dynamic batching.
  • --threads=<N>: Sets the number of CPU threads for inference.
  • --streams=<N>: Configures the number of CUDA streams for concurrent kernel execution.

When you specify --fp16 or --int8, TensorRT performs a precision calibration phase if needed (for INT8). It analyzes the activation ranges of your model to determine the optimal quantization parameters. This process is essential to maintain accuracy. If you omit calibration for INT8, TensorRT might use default, less accurate quantization, leading to degraded performance or accuracy.

The trtexec tool isn’t just for profiling; it’s also about building the final optimized engine. The --saveEngine flag produces a .trt file. This file contains the optimized graph, kernel selections, and memory layouts specific to your target GPU and the precision you chose. Loading this engine with trtexec --loadEngine=resnet50.trt or programmatically via the TensorRT API will bypass the costly engine building phase, allowing for near-instantaneous inference setup.

A common pitfall is not defining dynamic shapes correctly. If your model needs to handle variable input resolutions (e.g., different image sizes), you must use --minShapes, --optShapes, and --maxShapes. Failing to do so will result in an engine that only supports a single, static input size, or worse, an engine that errors out when presented with different dimensions. For instance, --optShapes='image:1x3x224x224' --minShapes='image:1x3x128x128' --maxShapes='image:1x3x512x512' will create an engine that optimizes for 224x224 inputs but can dynamically adjust batch size and spatial dimensions within the specified bounds.

After you’ve successfully profiled and optimized your model with trtexec, the next logical step is integrating this .trt engine into your application.

Want structured learning?

Take the full Tensorrt course →