Serializing and loading TensorRT engines is how you save a compiled model for later use, avoiding the costly recompilation step.

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

# Initialize TensorRT logger
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

# Create a builder
builder = trt.Builder(TRT_LOGGER)

# Create a network
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))

# Create a parser
parser = trt.OnnxParser(network, TRT_LOGGER)

# Load the ONNX model
with open("my_model.onnx", "rb") as model:
    if not parser.parse(model.read()):
        print("ERROR: Failed to parse the ONNX file.")
        for error in range(parser.num_errors):
            print(parser.get_error(error))
        exit()

# Configure the builder
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB workspace

# Build the engine
print("Building the TensorRT engine...")
engine = builder.build_engine(network, config)
print("Engine built successfully.")

# Serialize the engine
serialized_engine = engine.serialize()

# Save the serialized engine to a file
with open("my_engine.trt", "wb") as f:
    f.write(serialized_engine)
print("Engine serialized and saved to my_engine.trt")

# --- Loading the engine ---

# Create a runtime
runtime = trt.Runtime(TRT_LOGGER)

# Load the serialized engine from the file
with open("my_engine.trt", "rb") as f:
    serialized_engine_from_file = f.read()

# Deserialize the engine
loaded_engine = runtime.deserialize_cuda_engine(serialized_engine_from_file)
print("Engine deserialized successfully.")

# Create an execution context
context = loaded_engine.create_execution_context()

# Example inference (assuming input shape is (1, 3, 224, 224))
input_shape = (1, 3, 224, 224)
output_shape = (1, 1000) # Assuming 1000 classes

# Allocate host and device buffers
input_host = cuda.pagelocked_empty(trt.volume(input_shape), dtype=np.float32)
output_host = cuda.pagelocked_empty(trt.volume(output_shape), dtype=np.float32)
input_device = cuda.mem_alloc(input_host.nbytes)
output_device = cuda.mem_alloc(output_host.nbytes)

# Create a stream in which to run inference
stream = cuda.Stream()

# Prepare input data (dummy data for example)
input_host = np.random.rand(*input_shape).astype(np.float32)

# Transfer input data to the GPU
np.copyto(cuda.from_host_interface(input_device), input_host.ravel())

# Run inference
context.execute_async_v2(bindings=[int(input_device), int(output_device)], stream_handle=stream.handle)

# Transfer predictions back from the GPU
stream.synchronize()
cuda.memcpy_dtoh(output_host, output_device)

# Process output_host
print("Inference complete. Output shape:", output_host.shape)

The core problem this solves is the significant time cost of compiling an ONNX (or other format) model into a TensorRT engine. Compiling involves optimizing the model’s graph for a specific GPU architecture, selecting the best kernels, and quantizing weights if applicable. This can take minutes or even hours for large models. Serializing allows you to perform this compilation once and then reuse the optimized engine repeatedly, drastically speeding up deployment.

When you serialize an engine, you’re essentially taking the compiled, GPU-specific representation of your model and saving it as a byte stream. This byte stream contains all the information TensorRT needs to reconstruct the optimized execution plan, including layer configurations, kernel choices, and memory layouts. Loading (deserializing) this byte stream on a compatible GPU on a different machine (or the same machine later) bypasses the entire build process. The trt.Engine object is transformed into a contiguous block of memory, which can then be loaded back into a trt.ICudaEngine object by the TensorRT runtime.

The process of building an engine involves several steps: parsing the input model (e.g., ONNX), defining network layers (if not fully inferred from the parsed model), configuring builder settings (like optimization profiles, memory limits, and precision modes), and finally calling builder.build_engine(). The build_engine function is where the heavy lifting happens. It explores various optimization strategies and kernel implementations to find the most efficient execution plan for the target hardware. Once built, engine.serialize() converts this optimized plan into a portable byte array.

Deserialization is the reverse: runtime.deserialize_cuda_engine() takes that byte array and reconstructs the trt.ICudaEngine object. This loaded_engine is then used to create an trt.IExecutionContext, which is what you use to actually run inference. The IExecutionContext holds the state for a particular inference run, including device memory bindings.

The trt.Logger is crucial for debugging and understanding the build process. When building, it will output warnings and errors about kernel selection, layer fusion, and potential optimizations. When deserializing, it’s usually quieter unless there’s a fundamental incompatibility. The trt.BuilderConfig is where you set performance-critical parameters. config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) sets the maximum amount of GPU memory TensorRT can use for its internal workspace during the build phase. A larger workspace can sometimes lead to better optimizations but increases build time and memory requirements.

The most surprising thing about serialization is that a serialized engine is not tied to the exact build environment, but it is tied to the target GPU architecture and the TensorRT version. You can serialize an engine on one machine and deserialize it on another, as long as both machines have a compatible GPU (e.g., both Turing-generation NVIDIA GPUs) and a compatible TensorRT version. An engine built for a V100 GPU will not work on an A100, and an engine built with TensorRT 8.x might have compatibility issues with TensorRT 9.x if certain internal kernel structures change significantly. This is why specifying builder.max_batch_size (if not using explicit batch) or ensuring explicit batch is correctly handled during parsing is important, as it influences the engine’s structure.

The next step after mastering serialization is understanding how to manage multiple optimization profiles within a single engine, especially for models with dynamic input shapes.

Want structured learning?

Take the full Tensorrt course →