TensorRT’s dynamic shapes let you build a single engine that can handle a range of input dimensions, but setting them up is more about defining what you want to optimize for than just enabling flexibility.
Here’s a common scenario: you’ve built a TensorRT engine for an object detection model, and you want it to be able to process images of varying sizes without recompiling the engine.
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
EXPLICIT_BATCH = 1 << (int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
def build_engine(max_batch_size=1):
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(EXPLICIT_BATCH)
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB workspace
# Define dynamic input shape
# Input tensor name, datatype, shape (min, opt, max)
input_tensor = network.add_input(name="input_tensor", dtype=trt.float32, shape=(-1, 3, -1, -1))
# Add layers to your network (example: a simple convolution)
# For demonstration, let's assume a dummy layer
# In a real scenario, this would be your actual model layers
# Example: Convolution Layer
kernel_shape = (3, 3)
num_output_channels = 16
kernel = np.random.rand(*kernel_shape).astype(np.float32)
bias = np.random.rand(num_output_channels).astype(np.float32)
# Create a constant layer for the kernel weights
kernel_w = trt.ConstLayer(kernel.flatten())
kernel_w.name = "kernel_weights"
kernel_w_output = network.add_layer(kernel_w).get_output(0)
# Create a constant layer for the bias weights
bias_w = trt.ConstLayer(bias.flatten())
bias_w.name = "bias_weights"
bias_w_output = network.add_layer(bias_w).get_output(0)
# Convolution layer
conv_layer = network.add_convolution_nd(
input=input_tensor,
num_output_channels=num_output_channels,
kernel_shape=kernel_shape,
kernel=kernel_w_output,
bias=bias_w_output
)
conv_layer.padding_nd = (1, 1)
conv_layer.stride_nd = (1, 1)
conv_layer.name = "example_conv"
# Add an output tensor
output_tensor = network.add_activation(conv_layer.get_output(0), trt.ActivationType.RELU).get_output(0)
output_tensor.name = "output_tensor"
network.mark_output(output_tensor)
# Create optimization profiles
profile = builder.create_optimization_profile()
# Define the shape for the input tensor: (min_batch, min_height, min_width), (opt_batch, opt_height, opt_width), (max_batch, max_height, max_width)
# For dynamic shapes, you'll typically set batch size to 1 and min/max for height/width
profile.set_shape(name="input_tensor",
min=(1, 3, 224, 224), # min_batch, min_channels, min_height, min_width
opt=(1, 3, 256, 256), # opt_batch, opt_channels, opt_height, opt_width
max=(1, 3, 384, 384)) # max_batch, max_channels, max_height, max_width
config.add_optimization_profile(profile)
# Build the engine
serialized_engine = builder.build_serialized_network(network, config)
engine = trt.Runtime(TRT_LOGGER).deserialize_cuda_engine(serialized_engine)
return engine
# Build the engine with dynamic input shapes
engine = build_engine()
# --- Inference with dynamic shapes ---
context = engine.create_execution_context()
# Example input data with different shapes
input_data_1 = np.random.rand(1, 3, 256, 320).astype(np.float32) # H=256, W=320
input_data_2 = np.random.rand(1, 3, 300, 400).astype(np.float32) # H=300, W=400
# Allocate GPU memory
d_input_1 = cuda.mem_alloc(input_data_1.nbytes)
d_output_1 = cuda.mem_alloc(np.prod(context.get_binding_shape(1)) * np.dtype(np.float32).itemsize)) # Output shape is dynamic too
d_input_2 = cuda.mem_alloc(input_data_2.nbytes)
d_output_2 = cuda.mem_alloc(np.prod(context.get_binding_shape(1)) * np.dtype(np.float32).itemsize))
# Transfer input data to GPU
cuda.memcpy_htod(d_input_1, input_data_1)
cuda.memcpy_htod(d_input_2, input_data_2)
# Set input shape for the current inference
# This is crucial for dynamic shapes. You tell the context *which* shape to use for this run.
context.set_binding_shape("input_tensor", input_data_1.shape)
# Infer
bindings_1 = [int(d_input_1), int(d_output_1)]
stream = cuda.Stream()
context.execute_async_v2(bindings=bindings_1, stream=stream.handle)
stream.synchronize()
# Retrieve output
h_output_1 = np.empty(context.get_binding_shape(1), dtype=np.float32)
cuda.memcpy_dtoh(h_output_1, d_output_1)
print(f"Output shape 1: {h_output_1.shape}") # Will depend on the convolution kernel and padding
# Repeat for the second input
context.set_binding_shape("input_tensor", input_data_2.shape)
bindings_2 = [int(d_input_2), int(d_output_2)]
stream.synchronize()
context.execute_async_v2(bindings=bindings_2, stream=stream.handle)
stream.synchronize()
h_output_2 = np.empty(context.get_binding_shape(1), dtype=np.float32)
cuda.memcpy_dtoh(h_output_2, d_output_2)
print(f"Output shape 2: {h_output_2.shape}")
# Clean up
d_input_1.free()
d_output_1.free()
d_input_2.free()
d_output_2.free()
The core idea behind TensorRT’s dynamic shapes is that you define a range of possible input dimensions during engine building, and then at inference time, you tell the TensorRT context the exact dimensions for that specific inference request. This allows a single engine to be flexible. You’re not just enabling a feature; you’re guiding TensorRT’s optimizer.
When you create an OptimizationProfile, you specify the minimum, optimal, and maximum dimensions for each dynamic input tensor. TensorRT uses this information to generate kernel code that can handle any dimension within that range. The "optimal" shape is particularly important: it’s the shape that TensorRT will try to optimize for most aggressively. If your typical input size is close to the optimal shape, you’ll generally see better performance.
The set_shape method on the OptimizationProfile is where you define these ranges. For an input tensor named "input_tensor", you’d call profile.set_shape("input_tensor", min=..., opt=..., max=...). The dimensions in the tuple correspond to (batch_size, channels, height, width). For dynamic batch sizes, you’d use -1 for the batch dimension in min, opt, and max, and then manage batching separately. However, for dynamic spatial dimensions, it’s common to fix the batch size to 1 (or your maximum expected batch) and let the height and width vary.
During inference, before calling execute_async_v2, you must call context.set_binding_shape("input_tensor", actual_input_shape). This tells TensorRT the specific dimensions for the current batch. If you omit this, TensorRT will likely fall back to using the "optimal" shape defined in the profile, which might not match your actual input and could lead to errors or incorrect results.
The output tensor’s shape is also determined dynamically based on the input shape and the network’s operations. You can query the output shape using context.get_binding_shape(output_binding_index). This is why when allocating output memory, you often use the output shape obtained after set_binding_shape and execute_async_v2.
The "optimal" shape specified in the OptimizationProfile isn’t just a suggestion; it’s a target for TensorRT’s kernel auto-tuner. If your actual input shapes frequently deviate from the optimal one, you might not achieve the performance gains you expect. Choosing a good optimal shape, often based on the most common input resolution your application will encounter, is key.
If you find that performance is inconsistent across different dynamic shapes, revisit your OptimizationProfile. Ensure the opt shape is representative of your typical workload, and that the min/max bounds encompass all expected input sizes without being excessively large, which can lead to increased engine size and slower kernel selection.