Triton Concurrent Model Execution, often referred to as "model instances," lets you run the same model multiple times concurrently on the same GPU or across multiple GPUs.

Here’s Triton serving a TensorFlow ResNet-50 model, with two instances running on GPU 0 and one instance on GPU 1.

{
  "model_repository_path": "/mnt/models",
  "inference_address": "0.0.0.0:8001",
  "model_control_mode": "explicit",
  "tf_gpu_memory_fraction": 0.7,
  "batching": {
    "max_batch_size": 8,
    "preferred_batch_size": [4, 8]
  },
  "model_config_list": [
    {
      "config": {
        "name": "resnet50_v1",
        "platform": "tensorflow_graphdef",
        "max_batch_size": 8,
        "instance_group": [
          {
            "count": 2,
            "kind": "gpu",
            "gpus": [0]
          },
          {
            "count": 1,
            "kind": "gpu",
            "gpus": [1]
          }
        ]
      },
      "model_version_policy": {
        "all": {}
      }
    }
  ]
}

This configuration tells Triton to load the resnet50_v1 model. It will be deployed with two instances on GPU 0 and one instance on GPU 1. Each instance is configured to handle a maximum batch size of 8.

The core problem Triton Concurrent Model Execution solves is maximizing GPU utilization and throughput for stateless models. If you have a model that doesn’t maintain state between requests (like most inference models), and a single instance on a GPU isn’t fully saturating it, you can run more instances of that same model to handle more requests simultaneously. This is especially powerful when you have multiple GPUs, allowing you to distribute instances across them for even greater parallelism.

Internally, Triton creates separate execution contexts for each model instance. When a request arrives, Triton’s scheduler determines which instance can best handle it based on availability and load. For GPU instances, each instance gets its own dedicated portion of the GPU’s memory and compute resources. The tf_gpu_memory_fraction setting in the server configuration limits the total GPU memory that all TensorFlow models running on a GPU can consume. This prevents a single model or a combination of models from hogging all available GPU memory.

The instance_group is the key configuration parameter. You specify the count of instances, the kind (either gpu or cpu), and for gpu kind, the specific gpus to use. You can list multiple GPUs in the gpus array if you want an instance to potentially utilize multiple GPUs (though this is less common for standard inference). The count parameter here is per GPU specified in the gpus list. So, {"count": 2, "kind": "gpu", "gpus": [0]} means two instances on GPU 0.

The most surprising thing about instance_group is that you can define multiple instance_group entries for the same model. This allows for highly granular control. For example, you could run 2 instances of a model on GPU 0 and then another 1 instance of the same model on GPU 1, as shown in the example. You could even specify CPU instances alongside GPU instances if your model platform supports it and you have specific latency or resource allocation needs.

The model_version_policy determines which versions of the model Triton loads. {"all": {}} means load all available versions of the model found in the repository. If you have multiple versions deployed, Triton will attempt to distribute requests across instances of all loaded versions based on their configuration. For concurrent execution of the same model, you typically want to focus on a single, latest version.

When you specify multiple instances for a model, Triton automatically handles the request distribution. The scheduler tries to balance the load across the available instances. If one instance is busy, requests are routed to other available instances of the same model. This provides a built-in form of load balancing for identical model deployments.

The max_batch_size and preferred_batch_size settings are applied per instance. So, if you have 2 instances of a model, each capable of batching up to 8, your effective maximum batching throughput for that model is higher than with a single instance. However, Triton’s dynamic batching logic still applies to incoming requests, attempting to form batches up to the max_batch_size before sending them to an instance.

The next step after mastering concurrent model execution is often to explore Triton’s ensemble models, where you chain multiple models together to form a more complex inference pipeline.

Want structured learning?

Take the full Triton course →