Triton’s multimodal serving doesn’t just run two models side-by-side; it orchestrates them as a single, cohesive unit, dynamically routing data based on its modality.
Let’s watch Triton handle a request that needs both an image and text. Imagine we have a clip-vision model and a bert model. A client sends a JSON payload like this:
{
"inputs": [
{
"name": "image_input",
"data": "base64_encoded_image_bytes",
"datatype": "BYTES"
},
{
"name": "text_input",
"data": ["This is a sentence about the image."],
"datatype": "STRING"
}
],
"parameters": {
"vision_model_name": "clip-vision",
"language_model_name": "bert"
}
}
Triton’s multimodal server, configured with a specific model repository, receives this. It doesn’t just pass the base64 image to clip-vision and the string to bert independently. Instead, it uses a "ensemble" configuration. This ensemble defines how inputs are processed and which models are chained together.
Here’s a simplified config.pbtxt for such an ensemble:
name: "multimodal_ensemble"
platform: "ensemble"
input [
{
name: "image_input"
data_type: TYPE_BYTES
dims: [ -1 ]
},
{
name: "text_input"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
output [
{
name: "vision_output"
data_type: TYPE_FP32
dims: [ 512 ]
},
{
name: "language_output"
data_type: TYPE_FP32
dims: [ 768 ]
}
]
ensemble_scheduling {
step [
{
model_name: "clip-vision"
model_input [
{
input_name: "image_input"
source_input_name: "image_input"
}
]
model_output [
{
output_name: "OUTPUT__0"
destination_name: "vision_output"
}
]
},
{
model_name: "bert"
model_input [
{
input_name: "input_ids" # Assuming BERT expects 'input_ids'
source_input_name: "text_input"
}
]
model_output [
{
output_name: "OUTPUT__0" # Assuming BERT's main output is named this
destination_name: "language_output"
}
]
}
]
}
When the request arrives, Triton:
- Parses the input: It sees
image_input(base64 bytes) andtext_input(string array). - Consults the ensemble config: It identifies
multimodal_ensemble. - Executes Step 1: It routes
image_inputto theclip-visionmodel. Theclip-visionmodel decodes the base64, processes the image, and outputs a feature vector. This output is then internally mapped tovision_output. - Executes Step 2: It routes
text_inputto thebertmodel. Thebertmodel processes the text and outputs its own feature vector. This output is mapped tolanguage_output. - Combines Outputs: The ensemble gathers
vision_outputandlanguage_outputand returns them as a single response.
The "magic" here is the ensemble_scheduling. It defines a directed acyclic graph (DAG) of model executions. In this simple case, it’s a linear sequence, but you could have branching, merging, and conditional execution. The key is that Triton understands that the entire payload is for this one logical multimodal task, and it routes the parts of the payload to the correct sub-models as defined by the ensemble. The parameters in the client request are not directly used by Triton for model selection in this specific ensemble config, but they signal to the client application that these models are expected to be part of the ensemble. In more complex scenarios, these parameters could be used to dynamically select which ensemble or which step within an ensemble to execute.
The problem this solves is moving beyond simple, single-model inference. Real-world AI often requires multiple steps: feature extraction from raw data (like images), then feeding those features into a reasoning model, or combining text embeddings with image embeddings. Triton’s ensemble feature allows you to define these complex pipelines as a single, callable endpoint. You don’t need to manage the orchestration of multiple model calls, data transformations between them, or error handling across them at the application level. Triton handles it.
The most surprising thing about Triton’s ensemble system is that it doesn’t just execute models in parallel or sequence; it allows you to define arbitrary data flow between model outputs and inputs, effectively creating custom inference graphs that can be treated as a single model from the client’s perspective. This means you can build sophisticated multimodal pipelines, fusion networks, or multi-stage reasoning systems, all exposed through a unified gRPC or HTTP API.
Here’s how you might combine the outputs of vision_output and language_output in a subsequent step of a more complex ensemble, perhaps a fusion model:
name: "fusion_ensemble"
platform: "ensemble"
# ... inputs for vision and language models ...
input [
{
name: "vision_embedding"
data_type: TYPE_FP32
dims: [ 512 ]
},
{
name: "language_embedding"
data_type: TYPE_FP32
dims: [ 768 ]
}
]
output [
{
name: "final_prediction"
data_type: TYPE_FP32
dims: [ 10 ] # Example: 10 classes
}
]
ensemble_scheduling {
step [
{
model_name: "clip-vision" # First model
model_input [...]
model_output [
{
output_name: "OUTPUT__0"
destination_name: "vision_embedding"
}
]
},
{
model_name: "bert" # Second model
model_input [...]
model_output [
{
output_name: "OUTPUT__0"
destination_name: "language_embedding"
}
]
},
{
model_name: "fusion_model" # Third model, takes previous outputs
model_input [
{
input_name: "vision_input_to_fusion"
source_input_name: "vision_embedding"
},
{
input_name: "language_input_to_fusion"
source_input_name: "language_embedding"
}
]
model_output [
{
output_name: "OUTPUT__0"
destination_name: "final_prediction"
}
]
}
]
}
The source_input_name in a step refers to the destination_name from a previous step. This is how data flows. You can also have multiple models running in parallel, with their outputs feeding into a single subsequent model.
The next frontier after mastering multimodal ensembles is understanding how to manage stateful multimodal models, where the model’s internal state must be preserved across multiple requests, such as in conversational AI.