The Triton Performance Analyzer can tell you more about your model’s inference performance than you might expect, but most users only scratch the surface with basic throughput tests.

Let’s see it in action. Imagine you’ve got a model, resnet50-onnx, running on Triton. You’ve already built your model repository and have Triton running. You want to see how fast it can go, but also understand why it’s fast or slow.

First, set up your model repository. It should look something like this:

model_repository/
  resnet50-onnx/
    1/
      model.onnx
    config.pbtxt

The config.pbtxt is crucial:

name: "resnet50-onnx"
platform: "onnxruntime_onnx"
max_batch_size: 16
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

Now, let’s run a simple throughput test. You’ll need the tritonclient Python package.

import tritonclient.http as httpclient
import numpy as np
import time

# Connect to Triton
try:
    triton_client = httpclient.InferenceServerClient(url="localhost:8000")
except Exception as e:
    print("client creation failed: " + str(e))
    exit(1)

# Model and input details
model_name = "resnet50-onnx"
input_name = "input"
input_shape = (3, 224, 224)
input_dtype = np.float32
batch_size = 16 # Matches config.pbtxt

# Generate dummy data
dummy_input = np.random.rand(*input_shape).astype(input_dtype)
inputs = [httpclient.InferInput(input_name, dummy_input.shape, httpclient.DataType.FP32)]
inputs[0].set_data_from_numpy(dummy_input)

# Warmup
for _ in range(5):
    triton_client.infer(model_name, inputs)

# Benchmarking
num_inferences = 100
start_time = time.time()
for _ in range(num_inferences):
    triton_client.infer(model_name, inputs)
end_time = time.time()

# Calculate throughput
total_time = end_time - start_time
throughput = (num_inferences * batch_size) / total_time
print(f"Throughput: {throughput:.2f} inferences/sec")

This gives you a raw number. But what if you want to go deeper? The Analyzer is built for this. It uses its own client to generate load and collect detailed metrics.

To use the Analyzer, you’ll typically run it from the command line. Here’s a basic example:

triton_performance_analyzer --model-repository /path/to/your/model_repository \
                            --model-name resnet50-onnx \
                            --concurrency 4 \
                            --input-data /path/to/your/input_data.npz \
                            --output-path /path/to/analyzer_results

The --input-data flag is key. You need to provide sample input data that matches your model’s requirements. This is often a .npz file containing NumPy arrays for each input. The Analyzer will use this data to construct inference requests.

The Analyzer will run through several concurrency levels, starting low and increasing, measuring latency, throughput, and other metrics at each step. It saves detailed results, including per-request timing, to the --output-path.

The real power comes from understanding the metrics the Analyzer collects. Beyond simple throughput, it reports:

  • Request Latency: The time from when a request is sent to Triton until the response is received.
  • Inference Latency: The time Triton spends actually processing the request on the server.
  • Queue Latency: The time a request spends waiting in Triton’s queue before being processed.
  • Client-side Overhead: The time spent by the client preparing the request and processing the response.

By comparing these, you can pinpoint bottlenecks. High queue latency suggests Triton’s server-side is saturated (either CPU/GPU or network). High inference latency points to the model itself being slow or the hardware being overloaded. High client-side overhead might indicate network issues or inefficient data handling on the client.

The Analyzer also supports dynamic batching and model ensemble benchmarking. For dynamic batching, you don’t need to specify --concurrency; instead, you’d use --max-batch-size and --batch-interval to let the Analyzer discover the optimal batch size.

A crucial, often overlooked, aspect of Analyzer configuration is the --input-data-config flag. This allows you to specify how the Analyzer should generate or load input data, especially for models with complex input requirements or when you want to test specific data distributions. For instance, you can specify shapes and data types directly in a JSON file.

{
  "input_name": {
    "shape": [3, 224, 224],
    "data_type": "FP32"
  }
}

This configuration tells the Analyzer to generate random FP32 data for input_name with the specified shape. If your model has multiple inputs, you’d list them all.

The Analyzer’s --profile-export flag is another advanced feature. When enabled, it triggers detailed GPU profiling (using nvprof or nsys) for the duration of the benchmark. The output is a set of trace files that can be loaded into NVIDIA Nsight Systems. This is invaluable for understanding GPU utilization, kernel execution times, and memory transfers, revealing bottlenecks that simple latency metrics can’t show. You’ll need to ensure nsys is in your PATH and that you have the necessary permissions.

After running the Analyzer, you’ll find detailed CSV files in your --output-path. These files contain raw data for each inference request made during the benchmark. Examining these can reveal outliers or patterns in latency that are averaged out in summary statistics.

The next logical step after mastering performance analysis is understanding how to optimize your model for Triton, which often involves quantization or kernel fusion.

Want structured learning?

Take the full Triton course →