The perf_analyzer is a powerful tool for benchmarking inference performance in Triton Inference Server, but its output can be surprisingly opaque if you don’t understand the underlying mechanics of how it measures throughput and latency.

Let’s see it in action.

Imagine we have a simple TensorFlow model for image classification (resnet50_graphdef) deployed in Triton. We want to benchmark its throughput and latency with different batch sizes.

First, we need to start Triton with our model repository. For this example, let’s assume our model repository is at /models.

tritonserver --model-repository=/models

Now, we can run perf_analyzer. We’ll start with a batch size of 1 and measure for 20 seconds.

perf_analyzer -m resnet50_graphdef -b 1 -d 20

The output will look something like this:

[01/23/2024 10:00:00.123456] INFO: perf_analyzer.cc(194): Using model: resnet50_graphdef
[01/23/2024 10:00:00.123456] INFO: perf_analyzer.cc(210): Using model configuration:
  name: "resnet50_graphdef"
  platform: "tensorflow_graphdef"
  max_batch_size: 128
  input [
    {
      name: "input_1"
      data_type: TYPE_FP32
      dims: [ 3, 224, 224 ]
    }
  ]
  output [
    {
      name: "predictions/Softmax"
      data_type: TYPE_FP32
      dims: [ 1000 ]
    }
  ]

... (other INFO messages) ...

================================================================
  Instance Kind          Count      Mode    Batch Size   QPS
----------------------------------------------------------------
  CPU                    1          Async   1            150.56
================================================================
  Instance Kind          Count      Mode    Batch Size   QPS
----------------------------------------------------------------
  CPU                    1          Async   1            150.56
================================================================
  Instance Kind          Count      Mode    Batch Size   QPS
----------------------------------------------------------------
  CPU                    1          Async   1            150.56
================================================================

================================================================
  Instance Kind          Count      Mode    Batch Size   Latency (ms)
----------------------------------------------------------------
  CPU                    1          Async   1            6.64
================================================================
  Instance Kind          Count      Mode    Batch Size   Latency (ms)
----------------------------------------------------------------
  CPU                    1          Async   1            6.64
================================================================
  Instance Kind          Count      Mode    Batch Size   Latency (ms)
----------------------------------------------------------------
  CPU                    1          Async   1            6.64
================================================================

================================================================
  Instance Kind          Count      Mode    Batch Size   p95 Latency (ms)
----------------------------------------------------------------
  CPU                    1          Async   1            7.89
================================================================
  Instance Kind          Count      Mode    Batch Size   p95 Latency (ms)
----------------------------------------------------------------
  CPU                    1          Async   1            7.89
================================================================
  Instance Kind          Count      Mode    Batch Size   p95 Latency (ms)
----------------------------------------------------------------
  CPU                    1          Async   1            7.89
================================================================

This output shows us the Queries Per Second (QPS) and various latency metrics (average and 95th percentile) for a single CPU instance, running in asynchronous mode, with a batch size of 1.

The core problem perf_analyzer solves is providing a realistic measure of inference performance under load, accounting for the overheads of request scheduling, data serialization/deserialization, and model execution. It simulates a client sending requests to Triton and measures how many requests can be processed per second (throughput) and how long each request takes (latency).

Here’s how it works internally:

  1. Client Simulation: perf_analyzer acts as a client. It generates input data according to the model’s input tensor specifications.
  2. Request Generation: It sends these inputs to Triton as inference requests. The -b (batch size) flag determines how many individual inputs are grouped into a single inference request.
  3. Asynchronous vs. Synchronous: By default, perf_analyzer uses asynchronous inference (-a or --async). This means it sends a request and immediately sends another without waiting for the first to complete. This is crucial for achieving high throughput as it keeps the server busy. Synchronous mode (-s or --sync) waits for each request to complete before sending the next, which is generally less performant but useful for measuring the latency of a single, isolated request.
  4. Measurement:
    • Throughput (QPS): perf_analyzer measures the total number of requests successfully processed by Triton within a given duration (-d or --duration). QPS is calculated as (total successful requests) / duration.
    • Latency: For each completed request, perf_analyzer records the time from when the request was sent to when its response was received. It then reports the average latency and the 95th percentile latency. The 95th percentile latency is the value below which 95% of all recorded latencies fall, giving a better sense of the typical user experience than just the average.
  5. Concurrency: The -c or --concurrency flag is vital. It controls the number of concurrent requests perf_analyzer attempts to keep in flight to Triton. A higher concurrency level can saturate the server, revealing its true maximum throughput.

Let’s try a higher batch size and concurrency.

perf_analyzer -m resnet50_graphdef -b 32 -c 8 -d 20

This command will:

  • Use batch size 32 for each request.
  • Keep 8 concurrent requests in flight to Triton.
  • Run the benchmark for 20 seconds.

The output will show a significant increase in QPS, as Triton can process more data per request and keep its internal execution units more consistently utilized with multiple concurrent requests. Latency might also increase due to the higher load.

The most surprising aspect of perf_analyzer’s output is how sensitive it is to the interaction between batch size, concurrency, and the underlying model’s computational profile. For instance, you might observe that increasing the batch size beyond a certain point doesn’t improve QPS and can even degrade latency. This is because the model’s execution time (its "compute-bound" nature) becomes the bottleneck, and the overhead of packing more data into a batch outweighs the benefits of parallel processing within the model itself. Similarly, having too few concurrent requests leaves Triton idle, while too many can lead to resource contention and increased latency.

The next step after understanding raw throughput and latency is to explore how different hardware accelerators (like GPUs) and Triton’s various execution optimization features, such as model ensemble execution or custom backends, impact these metrics.

Want structured learning?

Take the full Triton course →