Building and running inference engines with TensorRT’s Python API is less about writing Python code and more about orchestrating a complex C++ compilation and runtime environment from Python.

Let’s see TensorRT in action. Imagine we have a simple ONNX model for a feed-forward network.

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

# Create a logger
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

# Load the ONNX model
model_path = "simple_model.onnx"
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
    with open(model_path, 'rb') as model:
        if not parser.parse(model.read()):
            print("ERROR: Failed to parse the ONNX file.")
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            exit()

    # Build the engine
    builder.max_batch_size = 1
    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB workspace

    serialized_engine = builder.build_serialized_network(network, config)
    if serialized_engine is None:
        print("ERROR: Failed to build the TensorRT engine.")
        exit()

    # Save the engine
    with open("simple_engine.trt", "wb") as f:
        f.write(serialized_engine)

print("Engine built and saved successfully.")

# --- Now, let's run inference ---

# Load the engine
with open("simple_engine.trt", "rb") as f:
    engine = trt.Runtime(TRT_LOGGER).deserialize_cuda_engine(f.read())

# Create execution context
context = engine.create_execution_context()

# Define input and output shapes (assuming the model has one input and one output)
input_shape = (1, 10) # Batch size 1, 10 features
output_shape = (1, 5) # Batch size 1, 5 classes

# Allocate host and device buffers
host_input = cuda.pagelocked_empty(trt.volume(input_shape), dtype=np.float32)
host_output = cuda.pagelocked_empty(trt.volume(output_shape), dtype=np.float32)
device_input = cuda.mem_alloc(host_input.nbytes)
device_output = cuda.mem_alloc(host_output.nbytes)

# Create a stream for kernel execution
stream = cuda.Stream()

# Prepare input data (dummy data for demonstration)
host_input = np.random.rand(*input_shape).astype(np.float32)

# Transfer input data to the device
cuda.memcpy_htod_async(device_input, host_input, stream)

# Execute inference
context.execute_async_v2(bindings=[int(device_input), int(device_output)], stream_handle=stream.handle)

# Transfer output data from the device to host
cuda.memcpy_dtoh_async(host_output, device_output, stream)

# Synchronize the stream
stream.synchronize()

print("Inference complete. Output:", host_output)

The core problem TensorRT solves is optimizing deep learning models for NVIDIA GPUs, achieving significantly higher throughput and lower latency than generic frameworks. It does this by performing several key transformations:

  1. Graph Optimizations: It flattens the computation graph, fuses layers (e.g., convolution, bias addition, and ReLU into a single kernel), and removes redundant operations.
  2. Kernel Auto-Tuning: For supported layers, TensorRT selects the most efficient CUDA kernels based on the target GPU architecture and the specific tensor dimensions.
  3. Precision Calibration: It can quantize models from FP32 to FP16 or INT8, drastically reducing memory footprint and increasing speed with minimal accuracy loss. This requires a calibration step for INT8.
  4. Engine Serialization: The optimized graph and kernels are compiled into a single, portable "engine" file that can be loaded and run quickly on a target GPU.

The tensorrt Python API acts as a wrapper around these C++ libraries. You define your network using trt.NetworkDefinition, parse it from formats like ONNX or UFF (though ONNX is far more common now), configure the build process with trt.BuilderConfig, and then build_serialized_network compiles it into an engine. This engine is then loaded by trt.Runtime and executed using an trt.IExecutionContext.

The pycuda library is essential because TensorRT heavily relies on CUDA for its computations. You’ll use pycuda for managing CUDA memory (allocating device buffers, copying data between host and device) and for launching kernels asynchronously via CUDA streams.

The bindings argument in context.execute_async_v2 is a list of device pointers for each input and output tensor. The order matters and corresponds to the order in which the engine expects them. trt.volume(shape) is a utility to calculate the total number of elements in a tensor given its shape.

The most surprising thing about TensorRT is how it handles dynamic shapes. While you build an engine with static input dimensions, you can create multiple execution contexts from a single engine, each configured for a different batch size or even different input dimensions (within the bounds defined during build time). This allows a single compiled engine to serve requests with varying shapes efficiently, avoiding the need to recompile for every possible input size. You achieve this by calling context.set_binding_shape(binding_index, shape) before execute_async_v2.

The next hurdle you’ll face is implementing INT8 or FP16 precision calibration for maximum performance gains, especially when dealing with models that are sensitive to numerical precision.

Want structured learning?

Take the full Tensorrt course →