Triton dynamically batches requests not because it’s trying to be clever, but because it’s fundamentally impossible to keep a GPU fully utilized with single, small inference requests.

Here’s a quick peek at Triton’s dynamic batching in action. Imagine you’ve got a TensorFlow model for image classification. You’d configure model.py like so:

import triton_python_backend.api as triton_api

class TritonPythonModel:
    def load_model_config(self, config):
        self.model_config = config
        self.batch_size_rule = config.get_max_batch_size()
        # ... other model loading ...

    def execute(self, requests):
        results = []
        # In a real scenario, this is where you'd handle batched requests.
        # For this example, we'll just simulate processing.
        for request in requests:
            # Simulate inference
            input_data = request.inputs()[0].as_numpy()
            # ... perform inference ...
            output_data = input_data * 2 # Dummy operation

            results.append(output_data)
        return results

# Example model.py snippet for a TensorFlow model
# (Actual implementation would involve TensorFlow operations)

And your config.pbtxt would look something like this:

name: "my_image_classifier"
platform: "tensorflow_savedmodel"
max_batch_size: 8 # Allow Triton to batch up to 8 requests
input [
  {
    name: "input_image"
    data_type: TYPE_FP32
    dims: [ 224, 224, 3 ]
  }
]
output [
  {
    name: "output_probabilities"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

dynamic_batching {
  preferred_batch_size: [ 4, 8 ] # Try to form batches of 4 or 8
  max_queue_delay_microseconds: 10000 # Wait up to 10ms for a batch
}

The core problem dynamic batching solves is GPU underutilization. A single inference request, especially for smaller models or smaller input sizes, often doesn’t provide enough parallel work to saturate a GPU. The GPU ends up spending most of its time idle, waiting for the next request. Dynamic batching allows Triton to group multiple incoming requests together into a single larger batch that can fully utilize the GPU. This significantly boosts throughput by amortizing the overhead of launching a kernel across more work.

Under the hood, Triton maintains a queue for each model. When a request arrives, it’s placed into this queue. Triton then has a scheduler that looks at the queue and decides when to form a batch. It considers a few factors:

  1. max_batch_size: The absolute maximum number of requests that can be grouped. This is a hard limit imposed by the model’s definition.
  2. preferred_batch_size: A list of batch sizes Triton will try to achieve. It aims for these sizes because they are often empirically found to be good for performance. It will attempt to fill up to the first preferred size, then the second, and so on.
  3. max_queue_delay_microseconds: The maximum amount of time Triton will wait for a batch to form before sending it out, even if it hasn’t reached a preferred size. This is crucial for reducing latency. If requests are sparse, you don’t want to wait forever for a batch to fill.

Triton’s scheduler is constantly monitoring these queues. When a request arrives, it might:

  • If the queue is empty and max_batch_size > 1, it starts a timer.
  • As more requests arrive, it checks if a preferred_batch_size can be met. If so, it might immediately form that batch.
  • If the timer (defined by max_queue_delay_microseconds) expires, it will form whatever batch it can, up to max_batch_size, even if it’s not a preferred size.

The goal is to balance throughput (packing more requests onto the GPU) with latency (not making individual requests wait too long).

Tuning preferred_batch_size and max_queue_delay_microseconds is key. If your requests are very bursty and arrive in large numbers, you might set preferred_batch_size to larger values and max_queue_delay_microseconds higher to ensure you always get large, efficient batches. If your requests are more sporadic and latency is paramount, you’d use smaller preferred_batch_size values and a very low max_queue_delay_microseconds to ensure requests get processed quickly, even if the batch isn’t perfectly sized.

The dynamic_batching configuration block itself is what enables this behavior. Without it, Triton would default to treating each request individually (effectively max_batch_size: 1) or would require you to manually manage batching on the client side. This dynamic batching is a Triton-specific feature that significantly simplifies achieving high throughput for GPU-accelerated inference.

When you’re trying to tune dynamic batching for maximum throughput, remember that the actual batch size formed is a result of requests arriving over time and the scheduler’s decision-making process based on your configuration. You can observe the batch sizes being formed by looking at the Triton server logs or by instrumenting your client to see how many requests are typically grouped together. It’s not uncommon for a preferred_batch_size of [4, 8] to result in batches of 3, 5, 7, or 8 depending on the precise timing of request arrivals and the max_queue_delay_microseconds.

The next thing you’ll likely run into is understanding how different data types and tensor shapes affect the memory footprint of these dynamically formed batches.

Want structured learning?

Take the full Triton course →