TensorRT Semantic Segmentation: Deploy Pixel Models (2026)

TensorRT can deploy semantic segmentation models, but the real magic is how it aggressively optimizes them for inference speed on NVIDIA GPUs, often outperforming the original framework’s inference.

Let’s look at a typical semantic segmentation pipeline using TensorRT. We’ll assume you’ve trained a model in a framework like PyTorch or TensorFlow.

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from PIL import Image
import cv2

# --- Configuration ---
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
model_path = "your_model.onnx" # Or your serialized TensorRT engine
input_shape = (1, 3, 512, 512) # Batch, Channels, Height, Width
output_shape = (1, num_classes, 512, 512) # Batch, Classes, Height, Width
image_path = "test_image.jpg"
output_image_path = "segmented_image.png"
class_colors = np.random.randint(0, 256, size=(num_classes, 3), dtype=np.uint8) # Example colors

# --- 1. Build or Load TensorRT Engine ---
def build_engine(model_path, input_shape, output_shape):
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30) # 1GB workspace

    # Parse ONNX model
    parser = trt.OnnxParser(network, TRT_LOGGER)
    with open(model_path, "rb") as model:
        if not parser.parse(model.read()):
            print("Failed to parse ONNX file")
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            return None

    # Define input and output tensors
    input_tensor = network.get_input(0)
    input_tensor.shape = input_shape

    # Explicitly set output tensor shape if needed (especially for dynamic shapes)
    output_tensor = network.get_output(0)
    output_tensor.shape = output_shape

    # Build the engine
    engine = builder.build_engine(network, config)
    return engine

# If you have a serialized engine, load it directly
# with open("your_engine.trt", "rb") as f:
#     runtime = trt.Runtime(TRT_LOGGER)
#     engine = runtime.deserialize_cuda_engine(f.read())

# --- 2. Preprocess Input Image ---
def preprocess_image(image_path, input_shape):
    img = cv2.imread(image_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # Convert to RGB
    img = cv2.resize(img, (input_shape[3], input_shape[2])) # Resize to model input HxW
    img = img.astype(np.float32)
    img = img.transpose((2, 0, 1)) # HWC to CHW
    img = np.expand_dims(img, axis=0) # Add batch dimension
    img = img / 255.0 # Normalize to [0, 1] (adjust if your model expects different normalization)
    return img

# --- 3. Inference ---
def do_inference(context, bindings, inputs, outputs, stream):
    # Transfer input data to the GPU
    cuda.memcpy_htod_async(inputs[0]['device'], inputs[0]['host'])
    # Execute inference
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    # Transfer predictions back to the CPU
    cuda.memcpy_dtoh_async(outputs[0]['host'], outputs[0]['device'])
    stream.synchronize()
    return outputs[0]['host']

# --- 4. Postprocess Output ---
def postprocess_segmentation(output_data, output_shape, class_colors, original_image_shape):
    # Get the class with the highest probability for each pixel
    output_data = output_data.reshape(output_shape)
    segmentation_mask = np.argmax(output_data, axis=1) # Shape: (batch_size, H, W)
    segmentation_mask = segmentation_mask.squeeze() # Remove batch dimension

    # Resize mask back to original image dimensions if needed
    segmentation_mask_resized = cv2.resize(segmentation_mask.astype(np.uint8),
                                           (original_image_shape[1], original_image_shape[0]),
                                           interpolation=cv2.INTER_NEAREST) # Use nearest neighbor for masks

    # Create a colored segmentation image
    colored_segmentation = np.zeros((original_image_shape[0], original_image_shape[1], 3), dtype=np.uint8)
    for class_id, color in enumerate(class_colors):
        colored_segmentation[segmentation_mask_resized == class_id] = color
    return colored_segmentation

# --- Main Execution ---
if __name__ == "__main__":
    # Build or load the engine
    engine = build_engine(model_path, input_shape, output_shape)
    if not engine:
        exit()

    # Create execution context
    context = engine.create_execution_context()

    # Allocate host and device buffers
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()

    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append({'host': host_mem, 'device': device_mem})
        else:
            outputs.append({'host': host_mem, 'device': device_mem})

    # Load and preprocess image
    original_image = cv2.imread(image_path)
    preprocessed_image = preprocess_image(image_path, input_shape)

    # Perform inference
    inference_output = do_inference(context, bindings, inputs, outputs, stream)

    # Postprocess and save the result
    # Need to get the original image shape for resizing the mask
    original_image_shape = original_image.shape
    segmented_image = postprocess_segmentation(inference_output, output_shape, class_colors, original_image_shape)

    # Overlay segmentation on original image (optional)
    alpha = 0.5
    overlay = cv2.addWeighted(original_image, 1 - alpha, segmented_image, alpha, 0)

    cv2.imwrite(output_image_path, overlay)
    print(f"Segmentation saved to {output_image_path}")

The core idea is to take a trained model (often exported to ONNX), convert it into a highly optimized TensorRT engine, and then run inference using that engine. TensorRT performs several key optimizations:

Layer and Tensor Fusion: It merges multiple layers (like convolution, bias addition, and activation) into a single kernel. This reduces memory bandwidth usage and kernel launch overhead.
Precision Calibration: It can quantize models from FP32 to FP16 or INT8. For INT8, it requires a calibration step where representative data is passed through the network to determine optimal quantization scales. This significantly boosts performance with minimal accuracy loss.
Kernel Auto-Tuning: TensorRT selects the most efficient CUDA kernels for your specific GPU architecture and input shapes.
Dynamic Tensor Memory: It reuses memory for tensors that are no longer needed, reducing overall memory footprint.

The Mental Model:

Training: You train your segmentation model (e.g., U-Net, DeepLab) in a framework like PyTorch or TensorFlow.
Export to ONNX: Convert your trained model to the Open Neural Network Exchange (ONNX) format. This is a common intermediate representation.
TensorRT Engine Building: Use the TensorRT Builder API to parse the ONNX file and create an optimized engine. This step is where most of the heavy lifting happens. You specify optimization profiles (e.g., input shapes, batch sizes) and hardware configuration.
Inference Runtime: Load the built engine using the TensorRT Runtime.
Execution Context: Create an ExecutionContext from the engine. This context manages the state of a single inference run.
Memory Management: Allocate host (CPU) and device (GPU) memory buffers for input and output tensors. TensorRT’s bindings array maps these buffers.
Inference: Transfer input data to the GPU, execute the engine using the context, and transfer results back.
Post-processing: Convert the raw output (typically logits or probability maps) into a usable segmentation mask, often involving argmax and resizing.

The most surprising thing about TensorRT deployment is how much its optimizations go beyond simple graph pruning or kernel selection. It fundamentally restructures the computation, often fusing operations in ways that are not exposed or easily achievable in the original training framework, leading to dramatic speedups. For example, a convolution followed by a ReLU and a bias add might be fused into a single, highly optimized CUDA kernel that performs all three operations simultaneously, avoiding intermediate memory writes and reads.

When you’re debugging performance or looking to squeeze out every last millisecond, remember that TensorRT’s BuilderConfig allows fine-grained control over memory pool limits (especially WORKSPACE), precision modes (FP32, FP16, INT8), and even specific kernel selection strategies. The set_memory_pool_limit for WORKSPACE is crucial for large models or complex layers, as it dictates how much temporary memory TensorRT can use for complex operations.

The next step after mastering basic deployment is understanding how to handle dynamic input shapes, which is critical for variable-sized images or batching strategies, often managed through IOptimizationProfile.

More Deep Dives in Tensorrt