Triton can serve engines that have been optimized by TensorRT, which is NVIDIA’s SDK for high-performance deep learning inference.

Let’s see what this looks like in practice. Imagine we have a trained PyTorch model for image classification. We first optimize it using TensorRT to get a .plan file, which is essentially a serialized TensorRT engine.

import torch
import tensorrt as trt
from torch2trt import torch2trt

# Assume 'model' is your trained PyTorch model
# Assume 'input_tensor' is a dummy input tensor with the correct shape and dtype
# e.g., input_tensor = torch.randn(1, 3, 224, 224).cuda()

# Convert PyTorch model to TensorRT engine
# This step can take a while depending on the model size and hardware
trt_engine = torch2trt(model, [input_tensor], fp16_mode=True, max_batch_size=1)

# Save the TensorRT engine to a file
with open("model.plan", "wb") as f:
    f.write(trt_engine.serialize())

Now, we want Triton to serve this model.plan. We’ll create a model repository for Triton. This repository is a directory structure that Triton expects.

model_repository/
  my_model/
    1/
      model.plan
    config.pbtxt

The model.plan file is our serialized TensorRT engine. The config.pbtxt file tells Triton about the model.

name: "my_model"
platform: "tensorrt_plan"
input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [3, 224, 224]
  }
]
output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [1000]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [0]
  }
]

This config.pbtxt specifies:

  • name: The name of our model.
  • platform: Crucially, this is "tensorrt_plan", telling Triton to use its TensorRT backend.
  • input and output: Defines the expected input and output tensor names, data types, and shapes. The names input__0 and output__0 are often generated by torch2trt or other conversion tools and are important to match.
  • instance_group: Configures how many model instances to run and on which GPUs.

With this setup, we can launch Triton with the model repository:

tritonserver --model-repository=/path/to/model_repository

Triton will load my_model, recognize it as a TensorRT plan, and make it available for inference requests.

The core problem TensorRT solves is taking a high-level framework model (like PyTorch, TensorFlow, ONNX) and transforming it into a highly optimized, low-level kernel execution plan tailored for NVIDIA GPUs. This involves techniques like layer and tensor fusion, kernel auto-tuning, and precision calibration. Triton then acts as the inference serving layer, efficiently managing these optimized engines, handling requests, batching, and multi-model deployment.

When you convert a model to a TensorRT engine, TensorRT analyzes the model’s computational graph and applies numerous optimizations. It can fuse layers that are computationally adjacent (e.g., a convolution followed by a bias add and an activation function) into a single, highly optimized kernel. It also selects the best-performing kernels for each operation on your specific GPU architecture and can perform precision reduction (e.g., FP32 to FP16) with minimal accuracy loss, significantly boosting throughput and reducing latency. Triton’s TensorRT backend directly loads this serialized engine, bypassing the need for an intermediate framework runtime and leveraging these deep optimizations.

The surprising efficiency comes from TensorRT’s ability to compile a static execution graph that is deeply optimized for the target hardware. Unlike dynamic execution graphs that might involve Python overhead or framework-specific dispatching for each inference call, a TensorRT engine is a pre-compiled, highly specialized sequence of GPU kernel launches. Triton’s role is to expose this compiled artifact in a scalable and manageable way, handling network communication, request batching, and multi-GPU deployment without re-interpreting the model graph itself.

The config.pbtxt’s input and output dims are critical, but if your model supports dynamic batching, you can specify a batch dimension. For instance, if input__0 can accept batches of varying sizes, you might see dims: [ -1, 3, 224, 224 ] where -1 signifies a dynamic batch dimension. This allows Triton to group multiple incoming requests together into a single larger batch for the TensorRT engine, further improving GPU utilization and throughput, but it requires the TensorRT engine to have been built with max_batch_size greater than 1 and the TritonModelConfigPolicy to be set to MAX_BATCH_SIZE.

The next step would be to explore Triton’s dynamic batching capabilities for TensorRT models.

Want structured learning?

Take the full Triton course →