The Triton Model Repository is more than just a directory; it’s a structured, versioned system for managing your machine learning models that Triton can load and serve.
Let’s see it in action. Imagine you have two versions of a PyTorch model, my_model, one trained for batch size 1 and another for batch size 8.
model_repository/
├── my_model/
│ ├── 1/
│ │ └── model.pt
│ ├── 2/
│ │ └── model.pt
│ └── config.pbtxt
├── another_model/
│ ├── 1/
│ │ └── model.savedmodel
│ └── config.pbtxt
└── ...
Here, my_model is the model name. Inside it, 1 and 2 are model versions. Each version directory contains the actual model artifacts. For PyTorch, this is model.pt; for TensorFlow SavedModel, it’s the saved_model.pb and variables directory. The crucial piece is config.pbtxt at the model’s root level.
The config.pbtxt file is the control center for each model. It tells Triton everything it needs to know: the model’s platform, how to handle input/output tensors, batching strategies, and more.
Here’s a config.pbtxt for our my_model:
name: "my_model"
platform: "pytorch_libtorch"
max_batch_size: 8
input [
{
name: "INPUT"
data_type: TYPE_FP32
dims: [ 1, 224, 224 ]
}
]
output [
{
name: "OUTPUT"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
name is straightforward. platform specifies the inference backend (e.g., pytorch_libtorch, tensorflow_savedmodel, onnxruntime). max_batch_size is a key optimization; Triton can dynamically batch requests up to this size if your model supports it. The input and output sections define the tensor shapes and data types. Notice that dims can have variable dimensions (like 1 for batch size) which Triton can infer or be explicitly set.
Version directories (1, 2, etc.) allow for seamless updates and rollbacks. Triton can be configured to load a specific version, the latest version, or even multiple versions simultaneously for A/B testing. When Triton starts, it scans this repository. If you add a new version, say 3, to my_model/3/, and restart or trigger a repository update, Triton will pick it up.
The most surprising true thing about this system is that Triton doesn’t require a config.pbtxt for many common model types. If you omit it, Triton attempts to infer the configuration based on the model artifacts themselves. For a TensorFlow SavedModel, it can often figure out inputs, outputs, and data types. For ONNX, it reads the ONNX graph definition. However, explicit configuration gives you fine-grained control and is essential for advanced features like dynamic batching, instance groups, and custom model parameters.
Internally, Triton uses a "model manager" to watch the repository. It periodically scans directories and checks for changes. When a new version appears or a config.pbtxt is modified, it can reload the model without a full server restart. This is controlled by the model_control.reload_interval_S parameter in the main server.conf.
The exact levers you control are primarily within config.pbtxt: instance_group for distributing your model across GPUs, backend_config for model-specific parameters (like model_precision for TensorRT), and dynamic_batch_allow_input_name_first to control how dynamic batching interprets input tensor order.
The next concept you’ll run into is managing model dependencies and different model formats within the same repository.