Triton’s max_batch_size isn’t just about how many requests you can shove in at once; it’s the fundamental limit on how many individual data samples can be grouped together for a single inference execution.

Let’s see this in action. Imagine you’re running a simple model that takes a single image and predicts a class.

// model_config.pbtxt
name: "my_image_classifier"
platform: "tensorflow_saved_model"
max_batch_size: 4
input [
  {
    name: "input_image"
    data_type: TYPE_UINT8
    dims: [ 224, 224, 3 ]
  }
]
output [
  {
    name: "output_class"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

Here, max_batch_size: 4 means Triton will try to group up to 4 image requests into a single batch. When you send 4 individual image requests, Triton might combine them into a single tensor for your model. The input_image tensor for the model would then have dimensions like [4, 224, 224, 3]. The first dimension, 4, represents the batch size. If you send only 2 images, Triton will batch them into a [2, 224, 224, 3] tensor. If you send 5 images, the first 4 will be batched, and the 5th will wait for more requests to form another batch (or be processed as a smaller batch if the model supports it and batch_input_shape allows).

The core problem max_batch_size solves is throughput. Many models, especially on GPUs, are more efficient when processing data in batches. Processing 4 images one by one might take 4 seconds, but processing them as a batch of 4 might only take 1.5 seconds. max_batch_size is your primary lever for enabling this dynamic batching.

When you set max_batch_size to a value greater than 0, Triton enables dynamic batching for that model. This means Triton will collect incoming inference requests and group them into batches of up to max_batch_size for execution. The actual batch size used for any given inference call will be the number of requests currently available, up to max_batch_size.

If max_batch_size is set to 0, dynamic batching is disabled. Each request is sent to the model immediately as a batch of size 1. This is useful for latency-sensitive applications where you want the absolute minimum delay for each individual request, even if it means sacrificing overall throughput.

The input and output tensors must be defined in a way that accommodates batching. If your model expects an input tensor named input_data with shape [3, 224, 224], and you set max_batch_size to 8, Triton will automatically prepend the batch dimension. So, when you send requests, the input_data tensor will actually have the shape [8, 3, 224, 224]. Your model must be designed to accept this variable batch dimension as its first dimension. This is why many model training frameworks (like TensorFlow and PyTorch) allow you to define input shapes with a None or ? for the batch dimension, signifying it can be any size.

A common point of confusion is when max_batch_size is set, but the model’s input/output definitions don’t account for it. For instance, if your model strictly expects an input shape of [3, 224, 224] and you set max_batch_size to anything other than 0, Triton will attempt to send [N, 3, 224, 224] (where N is the actual batch size), and the model will likely error out with a shape mismatch. You’d then see an error like [enforce shape consistency] failed to get shape for input 'input_image'.

The batch_input_shape parameter in model_config.pbtxt is crucial here. If you want to explicitly control the fixed batch size your model always expects, you can use batch_input_shape. For example, if your model was trained to only accept batches of exactly 4, you’d set:

// model_config.pbtxt
name: "my_fixed_batch_model"
platform: "tensorflow_saved_model"
max_batch_size: 4 // This enables dynamic batching up to 4
batch_input_shape [
  {
    key: "input_image"
    value: {
      name: "input_image"
      dims: [ 4, 224, 224, 3 ] // Model expects exactly batch size 4
    }
  }
]
// ... other input/output definitions

With batch_input_shape, Triton will still collect requests up to max_batch_size. However, it will only execute an inference request when it has exactly enough requests to fill the batch_input_shape dimensions. If max_batch_size is 4 and batch_input_shape specifies a batch of 4, Triton will wait until it has 4 requests. If it has 2, and no more requests come in for a while, it might time out if timeout_action is configured. If you set max_batch_size to 8 but batch_input_shape to [4, 224, 224, 3], Triton will batch up to 8 requests, but will only send a batch to the model when it has 4 requests ready.

The default_batch_size field, when max_batch_size is 0, allows you to specify a default batch size for models that don’t support dynamic batching but you still want to batch some data. However, it’s far more common to use max_batch_size > 0 for dynamic batching.

When max_batch_size is set to 0, Triton does not perform dynamic batching. Each request is processed individually by the model as a batch of size 1. This minimizes latency for individual requests but can significantly reduce overall throughput, especially for models that benefit from batched execution.

The gather_kernel_input_flags and gather_kernel_output_flags within the model_config.pbtxt relate to how Triton handles data movement for batched inputs/outputs, and ensuring these are correctly set (often default is fine) is important for performance when dynamic batching is enabled.

If you’ve correctly configured max_batch_size and your model’s input/output shapes, but are still seeing issues, the next common problem is related to the instance_group configuration. Ensure you have enough model instances running to handle the batched workload, and that they are assigned to appropriate devices (like GPUs). If instance_group isn’t set up for multiple GPUs, or you have too few instances, you’ll effectively bottleneck your throughput even with perfect batching.

Want structured learning?

Take the full Triton course →