The Triton Sequence Batcher is an ingenious piece of engineering that allows you to serve stateful recurrent neural networks (RNNs) and long short-term memory (LSTM) models with unprecedented efficiency, effectively eliminating the need for complex custom batching logic on the client side.

Let’s see it in action. Imagine you have a stateful LSTM model that processes text, predicting the next word based on the preceding sequence. Without the sequence batcher, each time you send a new piece of text, you’d have to manage its state yourself, sending it back and forth between your client and Triton.

Here’s a simplified client-side request without the batcher, illustrating the state management problem:

import tritonclient.http as httpclient

client = httpclient.InferenceServerClient(url="localhost:8000")

# Initial request for a sequence
sequence_id = 123
input_data_1 = np.array([[...]], dtype=np.float32) # First part of sequence
inputs_1 = [httpclient.InferInput("INPUT__0", input_data_1.shape, "FP32")]
inputs_1[0].set_data_from_numpy(input_data_1)
outputs = [httpclient.InferRequestedOutput("OUTPUT__0")]

# First inference: model returns output and *new* state
# You'd need to capture this state and send it back next time
inference_response_1 = client.infer(
    "my_lstm_model",
    inputs_1,
    outputs=outputs,
    sequence_id=sequence_id
)
output_data_1 = inference_response_1.as_numpy("OUTPUT__0")
# new_state_1 = inference_response_1.get_state() # Hypothetical state object

# Next inference, you'd need to pass the captured state
# input_data_2 = np.array([[...]], dtype=np.float32) # Second part of sequence
# inputs_2 = [httpclient.InferInput("INPUT__0", input_data_2.shape, "FP32")]
# inputs_2[0].set_data_from_numpy(input_data_2)
# inputs_2.append(httpclient.InferInput("STATE__0", new_state_1.shape, "INT64")) # Hypothetical state input
# inputs_2[-1].set_data_from_numpy(new_state_1)

# inference_response_2 = client.infer(
#     "my_lstm_model",
#     inputs_2,
#     outputs=outputs,
#     sequence_id=sequence_id
# )
# output_data_2 = inference_response_2.as_numpy("OUTPUT__0")
# new_state_2 = inference_response_2.get_state()

This manual state management is tedious and inefficient. The Triton Sequence Batcher automates this. It groups inference requests for the same sequence into a single batch that is sent to the model. The model’s internal state is managed by Triton itself, and the batcher ensures that the correct state is fed back into the model for subsequent requests within that sequence.

Here’s how you configure a model to use the sequence batcher in its config.pbtxt:

name: "my_lstm_model"
platform: "tensorflow_savedmodel" # or tensorrt_plan,onnxruntime_onnx, etc.
max_batch_size: 8

input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 128 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 10 ]
  }
]

# This is the crucial part for sequence batching
instance_group [
  {
    count: 1
    kind: KIND_CPU
    gpus: [ 0 ]
  }
]
dynamic_batching {
  preserve_ordering: true
  default_batch_size: 4
  max_queue_delay_microseconds: 10000 # 10ms
  timeout_action: TIMEOUT_REJECT # or TIMEOUT_COMPLETE
}
sequence_batching {
  max_sequence_idle_microseconds: 600000000 # 10 minutes
  timeout_action: TIMEOUT_REJECT
  reusable_batch {
    max_size: 8
    candidate_batch_size: [ 4, 8 ]
  }
  state {
    name: "lstm_state" # Name of the state tensor in your model
    data_type: TYPE_FP32
    dims: [ 2, 1024 ] # Shape of the state tensor
  }
}

The sequence_batching block is where the magic happens.

  • max_sequence_idle_microseconds: Defines how long a sequence can be idle before Triton considers it finished and cleans up its state.
  • reusable_batch: Allows Triton to form batches of sequences that are ready to be processed together, even if they are not from the same original sequence. This is key for maximizing throughput. candidate_batch_size specifies the batch sizes Triton will attempt to form.
  • state: This section tells Triton about the state tensor(s) your model uses. You must name this state tensor in your config.pbtxt exactly as it is named in your model’s graph. Triton will automatically manage the input and output of this state tensor for you.

When using the sequence batcher, your client requests will look much simpler. You still specify a sequence_id for each request, but you no longer need to manually manage or send the state.

import tritonclient.http as httpclient

client = httpclient.InferenceServerClient(url="localhost:8000")

sequence_id = 123
input_data_1 = np.array([[...]], dtype=np.float32)
inputs_1 = [httpclient.InferInput("INPUT__0", input_data_1.shape, "FP32")]
inputs_1[0].set_data_from_numpy(input_data_1)
outputs = [httpclient.InferRequestedOutput("OUTPUT__0")]

# First inference for sequence 123
inference_response_1 = client.infer(
    "my_lstm_model",
    inputs_1,
    outputs=outputs,
    sequence_id=sequence_id
)
output_data_1 = inference_response_1.as_numpy("OUTPUT__0")
# No state management here! Triton handles it.

# Second inference for sequence 123
input_data_2 = np.array([[...]], dtype=np.float32)
inputs_2 = [httpclient.InferInput("INPUT__0", input_data_2.shape, "FP32")]
inputs_2[0].set_data_from_numpy(input_data_2)

inference_response_2 = client.infer(
    "my_lstm_model",
    inputs_2,
    outputs=outputs,
    sequence_id=sequence_id
)
output_data_2 = inference_response_2.as_numpy("OUTPUT__0")
# Again, no manual state passing.

The core idea is that Triton’s sequence batcher acts as a stateful orchestrator. It intercepts incoming requests, identifies them by their sequence_id, and queues them up. When enough requests for different sequences arrive, or when a max_queue_delay_microseconds is met, it forms a batch for the model. Crucially, it injects the correct state from the previous inference of that sequence into the model’s input for the current inference. The model’s output state is then captured and stored by Triton, ready for the next request in that sequence. This process is entirely transparent to the client.

A common misconception is that dynamic_batching and sequence_batching are mutually exclusive. They are not. dynamic_batching handles the general batching of independent requests, while sequence_batching specifically addresses the stateful nature of RNNs/LSTMs by grouping requests within a sequence and also allows inter-sequence batching when sequences are ready. The reusable_batch feature in sequence_batching is what enables this inter-sequence batching, allowing Triton to combine requests from different sequences into a single batch for improved GPU utilization, provided their states are managed correctly.

The next logical step after mastering stateful model serving is understanding how to optimize these stateful models for even higher throughput and lower latency using techniques like model ensemble.

Want structured learning?

Take the full Triton course →