TensorRT on Windows can feel like a tangled mess of dependencies, but once you untangle the Visual Studio and CUDA configuration, it clicks into place.
Let’s see TensorRT in action with a simple inference example. Imagine we have a trained ONNX model for image classification.
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from PIL import Image
import os
# --- Configuration ---
onnx_model_path = "resnet18.onnx" # Replace with your ONNX model path
output_engine_path = "resnet18.engine"
input_shape = (1, 3, 224, 224) # Example for a typical image model
output_classes = 1000 # Example for ImageNet
# --- Initialize TensorRT ---
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
parser = trt.OnnxParser(network, TRT_LOGGER)
# --- Load ONNX Model ---
with open(onnx_model_path, 'rb') as model:
if not parser.parse_from_buffer(model.read(), os.environ.get("ONNX_MODEL_VERSION", "")):
print("Failed to parse ONNX file")
for error in range(parser.num_errors):
print(parser.get_error(error))
exit(1)
# --- Configure Builder ---
config.max_workspace_size = 1 << 30 # 1GB workspace
# If you have multiple GPUs, you can set the preferred device:
# cuda.Device(0).use() # Use GPU 0
# --- Build Engine ---
print("Building TensorRT engine...")
engine = builder.build_engine(network, config)
if not engine:
print("Engine building failed.")
exit(1)
print("Engine built successfully.")
# --- Save Engine ---
with open(output_engine_path, "wb") as f:
f.write(engine.serialize())
print(f"Engine saved to {output_engine_path}")
# --- Inference ---
print("Running inference...")
context = engine.create_execution_context()
# Prepare input data (dummy data for demonstration)
input_data = np.random.rand(*input_shape).astype(np.float32)
input_binding_idx = engine.get_binding_index("input") # Replace with your input tensor name
output_binding_idx = engine.get_binding_index("output") # Replace with your output tensor name
# Allocate device memory for inputs and outputs
input_shape_on_device = context.get_binding_shape(input_binding_idx)
output_shape_on_device = context.get_binding_shape(output_binding_idx)
input_dtype = trt.nptype(network.get_binding_dtype(input_binding_idx))
output_dtype = trt.nptype(network.get_binding_dtype(output_binding_idx))
d_input = cuda.mem_alloc(input_data.nbytes)
d_output = cuda.mem_alloc(np.empty(output_shape_on_device, dtype=output_dtype).nbytes)
bindings = [int(d_input), int(d_output)]
# Transfer input data to device
cuda.memcpy_htod(d_input, input_data)
# Run inference
context.execute_v2(bindings=bindings)
# Transfer output data from device
output_data = np.empty(output_shape_on_device, dtype=output_dtype)
cuda.memcpy_dtoh(output_data, d_output)
print("Inference complete. Output shape:", output_data.shape)
# Example: Get the top predicted class
# predicted_class = np.argmax(output_data)
# print("Predicted class:", predicted_class)
The core problem TensorRT solves is making deep learning inference blazing fast on NVIDIA GPUs. It does this by taking a trained model (like one from TensorFlow, PyTorch, or ONNX) and optimizing it for a specific GPU architecture. This optimization involves several steps: layer fusion (combining multiple operations into one kernel), precision calibration (using FP16 or INT8 instead of FP32 where possible without significant accuracy loss), and kernel auto-tuning to find the most efficient CUDA kernels for your hardware. Think of it as a highly specialized compiler for neural networks on GPUs.
The mental model for TensorRT involves three main phases: Building, Deploying, and Inferencing.
-
Building: This is where you take your trained model and convert it into a TensorRT engine. You’ll use the
tensorrt.Builderto create aNetworkDefinitionfrom your model (e.g., by parsing an ONNX file), configure the builder (setting memory limits, precision modes), and finally, build theEngine. The engine is a serialized, optimized representation of your model, tailored for your specific GPU. -
Deploying: The serialized engine file (
.engine) is what you distribute. It’s hardware-specific but not tied to a particular framework. -
Inferencing: On the target machine, you load the engine, create an
ExecutionContext, allocate device memory for your input and output tensors, copy your input data to the GPU, run the inference, and then copy the results back. This is the part you see in the Python code above.
The key levers you control are:
- Precision:
builder.create_builder_config().set_flag(trt.BuilderFlag.FP16)orINT8. This dramatically impacts performance and memory usage. - Workspace Size:
config.max_workspace_size = 1 << 30. This is the maximum GPU memory TensorRT can use for its internal optimizations and kernel selection. Too small, and it might fail or be slower; too large, and it might not fit. - Batch Size: While the engine is often built for a fixed batch size, newer TensorRT versions support dynamic batch sizes. You can specify a range of batch sizes during engine building.
- Target GPU: TensorRT is highly optimized for specific NVIDIA GPU architectures. Building an engine on one generation of GPU and deploying on another might not yield optimal results.
The most surprising thing about TensorRT’s optimization process, especially when dealing with INT8 precision, is that it doesn’t just blindly quantize weights. It performs a "calibration" step where it feeds representative data through the network to determine the actual activation ranges. This allows it to choose quantization parameters that minimize accuracy loss, making INT8 inference often viable for production without noticeable degradation, provided the calibration dataset is representative of your inference data.
The next step you’ll likely encounter is integrating this engine into a more complex application, perhaps a C++ application or a web service, and handling real-time data streams for inference.