Triton can run your models in two distinct modes: stateless and stateful, and the difference isn’t just about whether your model needs to remember past inputs.
Let’s see this in action. Imagine you have a simple model that adds two numbers.
# Example stateless model (conceptually)
def predict(input1, input2):
return input1 + input2
When you send a request to a stateless Triton model, it treats each request as a completely independent event. There’s no memory of previous requests.
# Example stateless request (conceptual)
curl -X POST localhost:8000/v2/models/add_model/versions/1/infer \
-d '{"inputs": [{"name": "input1", "shape": [1], "datatype": "INT32", "data": [5]}, {"name": "input2", "shape": [1], "datatype": "INT32", "data": [3]}]}'
The output is 8. If you send another request with [10] and [2], you get 12. No connection between the two. This is the default and most common way to run models.
Now, consider a stateful model. This is for scenarios where the model’s output depends on a sequence of inputs, like in a recurrent neural network (RNN) for natural language processing or a recommendation system that maintains user session context.
# Example stateful model (conceptually)
class StatefulModel:
def __init__(self):
self.hidden_state = None
def predict(self, input_data):
# Model logic that updates and uses self.hidden_state
# ...
output, self.hidden_state = self.process(input_data, self.hidden_state)
return output
In Triton, stateful models are managed through a concept called "inference sessions." When you send a request to a stateful model, you’re not just sending data; you’re interacting with a specific session.
# Example stateful request (conceptual) - Initial request
curl -X POST localhost:8000/v2/models/rnn_model/versions/1/infer \
-H "Inference-Session-Id: session_abc" \
-d '{"inputs": [{"name": "sequence", "shape": [1, 10], "datatype": "INT32", "data": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]}'
# Example stateful request (conceptual) - Subsequent request for the same session
curl -X POST localhost:8000/v2/models/rnn_model/versions/1/infer \
-H "Inference-Session-Id: session_abc" \
-d '{"inputs": [{"name": "sequence", "shape": [1, 5], "datatype": "INT32", "data": [11, 12, 13, 14, 15]}]}'
The Inference-Session-Id header is crucial here. It tells Triton which specific instance of the model’s state to use for processing. If you omit it for a stateful model, Triton will typically create a new session for you.
The core problem Triton’s stateful mode solves is managing the lifecycle of these model instances. Without it, you’d have to build your own complex system to instantiate and track model states across multiple requests, especially in a distributed environment. Triton handles the creation, destruction, and routing of these stateful model instances.
You configure this in your config.pbtxt file. For a stateless model, you typically don’t need any special settings related to state. However, for a stateful model, you’d include:
name: "rnn_model"
platform: "pytorch_libtorch"
max_batch_size: 8
instance_group [
{
count: 1
kind: KIND_CPU
gpus: [0]
}
]
# This is the key for stateful models
dynamic_batching {
preserve_ordering: false
}
model_transaction_policy {
stateful_model_policy {
timeout_action: DESCRIBE
}
}
The stateful_model_policy block is what tells Triton this model is designed to manage state. The timeout_action specifies what happens if a session becomes idle for too long; DESCRIBE means Triton will log its state and potentially terminate it.
The dynamic_batching block, particularly when preserve_ordering is false (which is common for stateful models where requests might arrive out of order for the same session), can be a bit of a red herring. While dynamic batching is a performance optimization for stateless models to group requests, its interaction with stateful models is primarily about how Triton orchestrates the batching of requests within a single session. It doesn’t imply that multiple sessions will be batched together. Triton’s stateful model management is about individual session isolation.
What most people don’t realize is that even though you declare a model as stateful, Triton doesn’t force every incoming request to adhere to a session. If you send a request to a stateful model without an Inference-Session-Id header, Triton will, by default, create a new session for that request and return a Inference-Session-Id in the response. This allows you to "start" a new stateful interaction implicitly.
The next concept you’ll likely encounter is how to manage and clean up these stateful sessions, especially when dealing with long-running applications or potential resource leaks.