Triton Inference Server can be configured to use TensorRT as a backend for accelerating model inference.

Let’s look at a typical setup. Suppose you have a TensorFlow model you want to deploy. You’d convert it to TensorRT first. This involves using trtexec or the TensorRT Python API to serialize your model into a .plan file.

trtexec --onnx=/path/to/your/model.onnx --saveEngine=/path/to/your/model.plan --fp16 --shapes=input_tensor:1x3x224x224

This command takes your ONNX model, optimizes it for your specific GPU (in this case, using FP16 precision and a fixed input shape), and saves the optimized engine.

Once you have the .plan file, you configure Triton to use it. This is done through config.pbtxt files.

name: "my_trt_model"
platform: "tensorrt_plan"
max_batch_size: 16
input [
  {
    name: "input_tensor"
    data_type: TYPE_FP16
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "output_tensor"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

Here, platform: "tensorrt_plan" tells Triton to load the TensorRT backend. The input and output definitions must match the model’s expected data types and shapes. max_batch_size is crucial for throughput. The instance_group specifies how many copies of the model to run on which GPUs.

Triton loads this configuration and, when a request comes in for "my_trt_model", it dispatches it to the TensorRT backend. The TensorRT engine then performs the inference on the GPU.

The real magic happens in how TensorRT fuses operations. Unlike standard frameworks that execute each layer sequentially, TensorRT analyzes the entire graph and can combine multiple layers into a single, highly optimized kernel. For example, a convolution followed by an element-wise addition and a ReLU activation can often be fused into a single GPU kernel, eliminating intermediate memory copies and significantly speeding up computation.

The trtexec tool itself is incredibly powerful. It can profile your model, identify bottlenecks, and suggest optimizations. You can experiment with different precision modes (--fp16, --int8), different batch sizes, and even specific GPU architectures to find the optimal engine for your hardware.

trtexec --onnx=/path/to/your/model.onnx --loadEngine=/path/to/your/model.plan --warmup=10 --iterations=100 --duration=5 --throughput

This command would load an already-built engine and benchmark its performance. The --throughput flag will give you metrics like inferences per second and latency.

When setting up TensorRT models in Triton, a common point of confusion is matching the data_type in the config.pbtxt to the actual data type used when the TensorRT engine was built. If your engine was built with FP16 but you specify TYPE_FP32 in the config, you’ll see strange results or outright errors. Always ensure consistency.

Another subtle but important aspect is the dims in the config.pbtxt. These define the shape of the input tensor excluding the batch dimension. TensorRT expects this. When you send data to Triton, the batch dimension is handled by Triton itself. If your model expects [batch_size, 3, 224, 224], your config.pbtxt should have dims: [3, 224, 224].

The primary benefit of using TensorRT with Triton is achieving state-of-the-art inference performance on NVIDIA GPUs. TensorRT’s optimizations, combined with Triton’s efficient request handling and batching, can lead to dramatic improvements in throughput and reductions in latency, especially for complex deep learning models.

One aspect that often catches people off guard is how TensorRT handles dynamic batching and variable input shapes. While trtexec typically requires a fixed shape during engine creation, TensorRT itself supports dynamic shapes. To leverage this in Triton, you would configure your config.pbtxt with dynamic_batching enabled and specify a max_queue_delay_microseconds. This allows Triton to accumulate requests and send them to the TensorRT backend in batches of varying sizes, further optimizing throughput.

The next hurdle you’ll likely encounter is managing multiple TensorRT engines for different models or different versions of the same model within a single Triton instance.

Want structured learning?

Take the full Tensorrt course →