TensorRT can make your YOLO object detection models run blazing fast, but it’s not magic. It’s a compiler that optimizes your model for NVIDIA GPUs by fusing operations, quantizing weights, and selecting the best CUDA kernels.
Here’s YOLOv5 running inference with TensorRT, showing a ~3x speedup over PyTorch:
import torch
import cv2
import numpy as np
from models.experimental import attempt_load
from utils.general import non_max_suppression, scale_coords
from utils.augmentations import letterbox
# Load YOLOv5 model
weights = 'yolov5s.pt'
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = attempt_load(weights, map_location=device)
stride = int(model.stride.max())
# Prepare input image
img_path = 'data/images/zidane.jpg'
img = cv2.imread(img_path)
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
h0, w0 = img_rgb.shape[:2]
# Letterbox resize
img_resized, ratio, pad = letterbox(img_rgb, 640, stride=stride)
h, w = img_resized.shape[:2]
# Normalize and convert to tensor
img_tensor = img_resized.transpose((2, 0, 1))[::-1] # BGR to RGB, to 3x640x640
img_tensor = np.ascontiguousarray(img_tensor)
img_tensor = torch.from_numpy(img_tensor).to(device).float()
img_tensor /= 255.0
if img_tensor.ndimension() == 3:
img_tensor = img_tensor.unsqueeze(0)
# Inference
with torch.no_grad():
pred = model(img_tensor)[0]
# Apply NMS
conf_thres = 0.25
iou_thres = 0.45
classes = None
pred = non_max_suppression(pred, conf_thres, iou_thres, classes=classes, agnostic=False)
# Process detections
for i, det in enumerate(pred):
if len(det):
det[:, :4] = scale_coords(img_tensor.shape[2:], det[:, :4], (h0, w0), ratio=ratio, pad=pad).round()
for *xyxy, conf, cls in det:
label = f'{model.names[int(cls)]} {conf:.2f}'
# Draw bounding box (simplified for brevity)
print(f"Detected: {label} at {xyxy}")
The core idea is to convert your trained PyTorch or TensorFlow model into a TensorRT engine. This engine is a highly optimized, GPU-specific binary that bypasses many of the generic operations in deep learning frameworks. TensorRT does this by:
- Graph Optimization: It analyzes your model’s computation graph and applies optimizations like layer and tensor fusion. For example, it can fuse a convolution, bias addition, and activation function into a single, highly efficient CUDA kernel.
- Kernel Auto-Tuning: It selects the best CUDA kernels for your specific GPU architecture and the tensor dimensions involved in your model. This means it’s not just running generic code; it’s running code tailored for your hardware.
- Precision Calibration: It supports various precision modes, including FP32, FP16, and INT8. FP16 and INT8 can significantly boost performance and reduce memory usage with minimal accuracy loss, provided you calibrate the model correctly.
To use TensorRT with YOLO, you typically follow these steps:
- Export the Model: Convert your trained YOLO model (e.g., from PyTorch
.pt) into an intermediate format like ONNX. - Build the TensorRT Engine: Use the TensorRT
trtexeccommand-line tool or the Python API to build an engine from the ONNX file. This is where the optimization happens. - Inference with the Engine: Load the TensorRT engine and perform inference using the TensorRT runtime.
Here’s a simplified trtexec command for building an engine:
trtexec --onnx=yolov5s.onnx --saveEngine=yolov5s.engine --fp16 --workspace=1024
--onnx=yolov5s.onnx: Specifies the input ONNX model.--saveEngine=yolov5s.engine: Names the output TensorRT engine file.--fp16: Enables FP16 precision for faster inference and reduced memory.--workspace=1024: Allocates 1024 MB of GPU memory for TensorRT to use during the build process for kernel selection.
The surprising thing is that TensorRT doesn’t just pick the fastest kernels; it measures them. During the build phase, it runs micro-benchmarks on your specific GPU for different kernel implementations and tensor shapes, then selects the absolute fastest one for your model.
Once you have the .engine file, you’d load and run it using the TensorRT Python API. The inference loop would look something like this:
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
# Initialize TensorRT logger and context
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with trt.Runtime(TRT_LOGGER) as runtime:
with open("yolov5s.engine", "rb") as f:
serialized_engine = f.read()
engine = runtime.deserialize_cuda_engine(serialized_engine)
context = engine.create_execution_context()
# Allocate device memory
inputs, outputs, bindings, stream = common.allocate_buffers(engine)
# Load image data into input buffer (simplified)
# ... (similar image preparation as above, then copy to d_input)
# Run inference
with engine.get_binding_shape(inputs[0].binding) as shape:
input_data = np.random.rand(*shape).astype(np.float32) # Replace with your actual input data
d_input = cuda.mem_alloc(input_data.nbytes)
h_input = cuda.pagelocked_empty(input_data.shape, dtype=np.float32)
h_input[:] = input_data
cuda.memcpy_htod_async(d_input, h_input, stream)
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
# Retrieve results from output buffer (simplified)
# ... (copy results back from d_output to host, process detections)
The most subtle but critical aspect of TensorRT optimization, especially for INT8, is calibration. Without proper calibration data, INT8 quantization can drastically reduce accuracy. TensorRT uses a representative dataset to determine the activation ranges of different layers, allowing it to map FP32 values to INT8 ranges accurately. This process involves running inference on the calibration data and feeding the observed min/max activation values back to TensorRT during engine building.
The next hurdle you’ll face is managing different YOLO versions and their specific export requirements, as each might have slight variations in their ONNX output.