Converting an ONNX model to a TensorRT engine unlocks significant performance gains for deep learning inference.
Let’s see it in action. Imagine you have an ONNX model for image classification, say resnet50.onnx. You want to convert it to a TensorRT engine that’s optimized for your specific hardware, perhaps an NVIDIA T4 GPU.
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
EXPLICIT_BATCH = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
def build_engine(onnx_file_path, engine_file_path):
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) as network, builder.create_builder_config() as config:
# Parse ONNX model
with trt.OnnxParser(network, TRT_LOGGER) as parser:
if not parser.parse_from_file(onnx_file_path, EXPLICIT_BATCH):
print("Failed to parse ONNX file")
for error in range(parser.num_errors):
print(parser.get_error(error))
return None
# Configure builder
config.set_flag(trt.BuilderFlag.FP16) # Use FP16 for faster inference
config.max_workspace_size = 1 << 20 # 1GB of workspace
# Build engine
print("Building TensorRT engine...")
engine = builder.build_engine(network, config)
if engine:
print("Engine built successfully.")
# Serialize engine
with open(engine_file_path, "wb") as f:
f.write(engine.serialize())
print(f"Engine saved to {engine_file_path}")
return engine
else:
print("Engine build failed.")
return None
# Example usage
onnx_model = "resnet50.onnx"
engine_file = "resnet50.trt"
# Assuming resnet50.onnx exists in the current directory
# You might need to download a sample ONNX model if you don't have one.
# For example, from the ONNX Model Zoo.
# First, let's create a dummy ONNX file for demonstration if it doesn't exist
# In a real scenario, you'd have your actual trained ONNX model.
try:
import onnx
from onnx import TensorProto, helper
# Create a simple ONNX model (e.g., a single matrix multiplication)
input_tensor = helper.make_tensor_value_info('input', TensorProto.FLOAT, [1, 3, 224, 224])
output_tensor = helper.make_tensor_value_info('output', TensorProto.FLOAT, [1, 1000])
W = np.random.randn(1000, 3 * 224 * 224).astype(np.float32)
weight_tensor = helper.make_tensor('weight', TensorProto.FLOAT, [1000, 3 * 224 * 224], W.flatten().tolist())
node = helper.make_node('Gemm', ['input', 'weight'], ['output'], alpha=1.0, beta=0.0, transB=1) # transB=1 for matrix multiplication
graph_def = helper.make_graph([node], 'gemm_graph', [input_tensor], [output_tensor], [weight_tensor])
model_def = helper.make_model(graph_def, producer_name='onnx-example')
onnx.save(model_def, onnx_model)
print(f"Created dummy ONNX model: {onnx_model}")
except ImportError:
print("ONNX library not found. Please install it (`pip install onnx`).")
print(f"Skipping dummy ONNX model creation. Please ensure '{onnx_model}' exists.")
except Exception as e:
print(f"Error creating dummy ONNX model: {e}")
print(f"Skipping dummy ONNX model creation. Please ensure '{onnx_model}' exists.")
# Now, build the engine
try:
build_engine(onnx_model, engine_file)
except FileNotFoundError:
print(f"Error: ONNX model file '{onnx_model}' not found. Please ensure it exists.")
except Exception as e:
print(f"An error occurred during engine building: {e}")
This script takes an ONNX model, parses it, configures TensorRT’s builder (enabling FP16 precision and setting a workspace size), builds the optimized engine, and then serializes it to a file (resnet50.trt). This serialized engine is what you’ll load for inference.
The core problem this solves is the disconnect between a general-purpose model format like ONNX and the highly hardware-specific optimizations that deliver peak inference performance. ONNX describes the what (the layers, the operations), but TensorRT figures out the how for a particular NVIDIA GPU. It performs numerous optimizations:
- Layer and Tensor Fusion: TensorRT can combine multiple operations (like convolution, bias addition, and ReLU) into a single, highly optimized kernel. This reduces kernel launch overhead and memory bandwidth usage.
- Kernel Auto-Tuning: For each layer, TensorRT searches a vast library of optimized kernels and selects the fastest one for your specific GPU architecture and input dimensions.
- Precision Calibration: It can automatically convert models to use lower precision (like FP16 or INT8) without significant accuracy loss, dramatically speeding up computation and reducing memory footprint.
- Kernel Selection: It picks the most efficient algorithms for operations like convolution, GEMM (matrix multiplication), and pooling based on the input tensor shapes and hardware capabilities.
- Memory Optimization: TensorRT manages memory allocation and deallocation efficiently, minimizing fragmentation and maximizing cache utilization.
The builder.create_network(EXPLICIT_BATCH) call is crucial. It tells TensorRT that the batch dimension is explicit in the network definition, which is the modern standard for ONNX and required for many TensorRT features. Without it, TensorRT might infer a dynamic batch size, which can sometimes lead to less optimal engines or unexpected behavior.
The config.set_flag(trt.BuilderFlag.FP16) is a common optimization. FP16 (half-precision floating-point) offers about twice the throughput and half the memory usage of FP32 (single-precision) on modern NVIDIA GPUs, often with minimal accuracy degradation. If your model is sensitive to precision, you might omit this flag or explore INT8 quantization, which requires a calibration step.
config.max_workspace_size is the amount of GPU memory TensorRT can use during the build process to perform kernel auto-tuning and other complex optimizations. A larger workspace can sometimes lead to faster build times and better-optimized engines, but it’s a one-time cost during engine creation, not during inference. The value 1 << 20 means 2^20 bytes, which is 1 MiB. A more typical value for larger models might be 1 << 30 (1 GiB).
The engine.serialize() method converts the built engine into a byte stream, which is then saved to a file. This serialized engine is hardware-specific and can be loaded efficiently by the TensorRT runtime without needing to rebuild.
One thing most people don’t realize is that the order in which you specify layers to be kept (using network.mark_output()) and the precision flags you set can significantly influence the fusion and optimization strategies TensorRT employs. For instance, if you mark an intermediate tensor as an output during the build process, TensorRT might be forced to de-optimize certain fusions that would have otherwise occurred, as it needs to ensure that specific intermediate result is available. Conversely, if you don’t mark any explicit outputs, TensorRT will infer them from the ONNX graph, potentially enabling more aggressive fusion.
Once you have the .trt engine file, the next step is to load it into a TensorRT runtime for actual inference, which involves managing CUDA contexts and memory buffers.