Triton Inference Server can serve ONNX models, but getting it right means understanding how Triton interprets the ONNX graph and how to tune it for peak performance.

Let’s see Triton serving an ONNX model. Imagine we have a simple ONNX model model.onnx that takes a float32 tensor named input of shape [1, 3, 224, 224] and outputs a float32 tensor named output of shape [1, 1000].

First, we need a config.pbtxt file for Triton. This tells Triton about our model.

name: "my_onnx_model"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

Notice platform: "onnxruntime_onnx". This is crucial. It tells Triton to use the ONNX Runtime backend. The input and output definitions must match the ONNX model’s expected tensor names and shapes. max_batch_size allows Triton to group incoming requests for more efficient processing.

Now, we place model.onnx and config.pbtxt into a directory structure Triton expects:

models/
  my_onnx_model/
    1/
      model.onnx
      config.pbtxt

Triton is started with the --model-repository=models flag. Once running, we can send inference requests. Using curl and numpy for a Python client:

import numpy as np
import requests
import json

# Generate dummy input data
input_data = np.random.rand(1, 3, 224, 224).astype(np.float32)

# Prepare the request payload
payload = {
    "inputs": [
        {
            "name": "input",
            "shape": list(input_data.shape),
            "datatype": "FP32",
            "data": input_data.tolist()
        }
    ]
}

# Send the request to Triton (assuming Triton is running on localhost:8000)
response = requests.post("http://localhost:8000/v2/models/my_onnx_model/infer", json=payload)
result = response.json()

print(json.dumps(result, indent=2))

This sends a single inference request. Triton, using ONNX Runtime, will execute the model. The result will contain the output tensor.

The real power comes from understanding Triton’s inference configuration options and how they interact with ONNX Runtime.

The config.pbtxt file is where much of the optimization happens. Beyond max_batch_size, you can control dynamic batching:

name: "my_onnx_model"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [ ... ]
output [ ... ]
dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 1000
}

Here, preferred_batch_size hints to Triton about batch sizes that are efficient for this model. Triton will try to form batches of 4 or 8. max_queue_delay_microseconds specifies how long Triton will wait to collect more requests before starting inference on a partially filled batch. A lower value means lower latency but potentially smaller batch sizes.

Triton also allows you to specify an instance_group to control how many copies of your model run and on which devices:

name: "my_onnx_model"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [ ... ]
output [ ... ]
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0, 1 ]
  }
]

This configuration tells Triton to run two instances of my_onnx_model, each on a separate GPU (GPU 0 and GPU 1). This is key for throughput scaling. If you only have CPUs, you’d use kind: KIND_CPU and potentially count greater than 1 for multi-core utilization.

Model optimization within ONNX Runtime itself can also be leveraged. Triton’s ONNX Runtime backend supports ONNX Runtime’s execution providers. For example, to force CUDA execution (if you have a GPU and ONNX Runtime was built with CUDA support):

name: "my_onnx_model"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [ ... ]
output [ ... ]
optimization {
  execution_accelerators {
    gpu_execution_accelerator : [ {
      name : "tensorrt"
      parameters {
        key: "strip_unused_nodes"
        value {
          string_value: "true"
        }
      }
    }]
  }
}

This snippet, however, is pointing towards TensorRT, which is a separate optimization path. For pure ONNX Runtime and GPU, you’d typically ensure your ONNX Runtime installation supports CUDA and Triton’s ONNX Runtime backend is configured to use it. Often, this is handled by the environment Triton is running in and how ONNX Runtime is built. Triton’s configuration primarily directs which ONNX Runtime execution provider to use if multiple are available.

When ONNX Runtime loads an ONNX graph, it first analyzes the operations. If it finds operations that can be fused or optimized, it performs these transformations. It then selects the most appropriate execution provider (e.g., CPU, CUDA, TensorRT if available) to run the optimized graph. Triton’s role is to manage the model lifecycle, request batching, and select the correct backend platform (onnxruntime_onnx in this case), allowing ONNX Runtime to do its work.

The optimization.execution_accelerators block in the config.pbtxt is actually more for directing Triton to use specific optimization engines like TensorRT or OpenVINO, not for configuring ONNX Runtime’s internal CPU/CUDA execution providers directly. For ONNX Runtime, GPU acceleration is typically enabled by having the correct ONNX Runtime build (e.g., onnxruntime-gpu) and ensuring the CUDA drivers are present. Triton will then utilize the available execution providers.

A subtle but important detail is how Triton handles model versioning. If you have multiple versions of your ONNX model, you can place them in subdirectories like models/my_onnx_model/1/, models/my_onnx_model/2/, etc. Triton will load all versions. You can then request inference from a specific version using the ?version=<version_number> query parameter in your HTTP request.

The next hurdle you’ll likely face is understanding how to monitor inference performance. You’ll want to look at metrics like request latency, throughput, and GPU utilization to identify bottlenecks.

Want structured learning?

Take the full Triton course →