Triton Ensemble Pipelines let you chain multiple models together, but they don’t actually execute them sequentially.

Here’s a Triton ensemble pipeline in action. We’ll define an ensemble that takes an image, runs it through a feature extractor, and then feeds those features into a classifier.

// ensemble_config.pbtxt
name: "image_classifier_ensemble"
ensemble {
  step {
    model_name: "feature_extractor"
    model_version: -1
    input_map {
      key: "input_image"
      value: "input_image"
    }
    output_map {
      key: "extracted_features"
      value: "features"
    }
  }
  step {
    model_name: "classifier"
    model_version: -1
    input_map {
      key: "input_features"
      value: "features"
    }
    output_map {
      key: "class_probabilities"
      value: "class_probabilities"
    }
  }
}

This ensemble_config.pbtxt file describes our "image_classifier_ensemble." It has two steps:

  1. feature_extractor: This model takes an input named input_image and produces an output named extracted_features. In the ensemble, we’ll refer to this output as features.
  2. classifier: This model takes an input named input_features and produces an output named class_probabilities. In the ensemble, it receives the features from the previous step.

When you send a request to the image_classifier_ensemble, Triton doesn’t wait for feature_extractor to finish before sending its output to classifier. Instead, it constructs a single, internal request that includes the input for feature_extractor and placeholders for the output of classifier. Triton then dispatches this combined request to the underlying inference backends. The backends process the feature_extractor and classifier models concurrently, or in a way that allows them to operate on their respective inputs and produce outputs as they become available, rather than strictly waiting for a full sequential completion. The key is that Triton’s scheduler is orchestrating this, managing the data flow and dependencies. The input_map and output_map define how data flows between the models within the ensemble’s definition, but the execution itself is optimized by Triton’s runtime.

The problem this solves is reducing latency and increasing throughput. Instead of making two separate requests (one to feature_extractor, then another to classifier using the result), you make one request to the ensemble. Triton handles the internal orchestration, which can be significantly faster than client-side sequential calls, especially when dealing with many small models or models with high overhead per-request.

Here’s how you’d send a request to this ensemble using the Python client:

import tritonclient.http as httpclient

# Assuming Triton is running on localhost:8000
triton_client = httpclient.InferenceServerClient(url="localhost:8000")

# Prepare input data (e.g., image bytes)
# For demonstration, let's assume 'image_data' is loaded image bytes
# and 'features' is a NumPy array if it were a direct model call.
# For an ensemble, we only provide the *initial* input.
input_data = np.random.rand(1, 3, 224, 224).astype(np.float32) # Example image tensor

# Create the inference request
infer_input = httpclient.InferInput("input_image", input_data.shape, "FP32")
infer_input.set_data_from_numpy(input_data)

# The ensemble name is what we send the request to
ensemble_name = "image_classifier_ensemble"

# The output names are defined in the ensemble config
output_names = ["class_probabilities"]

results = triton_client.infer(
    ensemble_name,
    inputs=[infer_input],
    outputs=[httpclient.InferOutput(output_names[0], None, "FP32")]
)

# Process the results
probabilities = results.as_numpy(output_names[0])
print(probabilities)

The mental model to hold onto is that the ensemble configuration defines a directed graph of dependencies, and Triton’s scheduler executes this graph. The input_map and output_map are the edges of this graph, specifying how the output tensor of one node (model_name) becomes the input tensor of another. Triton’s runtime is responsible for efficiently scheduling the execution of these nodes, potentially in parallel or with overlapping execution, to minimize overall latency. The key is that you, as the client, only see a single request and a single response, abstracting away the internal multi-model execution.

What most people don’t realize is that the model_version: -1 in the ensemble configuration doesn’t just mean "use the latest version." It means Triton will dynamically resolve the latest version of the specified model at inference time if it has been updated. This allows you to update individual models within an ensemble without having to redeploy the ensemble configuration itself, provided their input/output tensor shapes and types remain compatible.

The next concept you’ll run into is handling conditional logic or branching within your pipelines.

Want structured learning?

Take the full Triton course →