Triton’s "control mode" is how you tell it to load models, and the two main ways are "poll" and "explicit."

Let’s see this in action. Imagine you have a directory structure like this:

/models
  /model_a
    config.pbtxt
    1/model.plan
  /model_b
    config.pbtxt
    1/model.plan

If Triton is started with --model-control-mode=poll --poll-interval-s=10, it will scan /models every 10 seconds. When it sees model_a and model_b for the first time, it loads them. If you then add /models/model_c/config.pbtxt and /models/model_c/1/model.plan, Triton will discover and load model_c on its next 10-second poll. Remove model_a, and it will be unloaded on the subsequent poll.

Now, contrast that with --model-control-mode=explicit. In this mode, Triton only loads models that are explicitly requested via its HTTP/gRPC API. If you start Triton with this mode, the /models directory is ignored for loading purposes. You’d then use tritonclient to tell Triton to load a model:

# Load model_a
tritonclient --model-control "load" --model-name model_a --model-version 1 --model-repository /models

# Unload model_a
tritonclient --model-control "unload" --model-name model_a --model-version 1

The core problem Triton solves here is managing the lifecycle of potentially many ML models within a single inference server. You don’t want to restart the server every time you deploy a new model or update an existing one. Control mode dictates how Triton discovers and manages these models.

Poll mode is simple: it’s a background process that periodically checks a designated repository for changes. When it finds a new model or an updated version (indicated by a new numbered directory within the model’s directory), it loads it. If a model is removed from the repository, poll mode will eventually detect that and unload it. The poll-interval-s flag determines how frequently this check happens. A smaller interval means quicker detection of changes but more CPU overhead.

Explicit mode, on the other hand, offers granular, on-demand control. You are the orchestrator. Triton waits for your explicit commands via its API to load or unload models. This is ideal for dynamic environments where models are deployed or updated frequently through automated CI/CD pipelines, or when you need precise control over which models are active at any given moment. The model-repository flag is still important in explicit mode because it tells Triton where to find the model files when you request them to be loaded.

The real magic of explicit mode is in its ability to manage model versions independently. When you request to load model_a version 1, Triton loads that specific version. You can then request to load model_a version 2 at the same time, and Triton will manage both. This allows for blue-green deployments or canary releases directly through the inference server’s API.

A subtle point about poll mode is how it handles model updates. If you have model_a/1/ and model_a/2/, and poll mode discovers both, it will load the highest numbered version. If you then remove model_a/2/, poll mode will automatically fall back to loading model_a/1/ on its next poll cycle. It’s not just about discovery; it’s about active management of the currently available versions.

The next concept you’ll likely encounter is how Triton handles model configuration files (config.pbtxt) and the implications of their content on loading behavior, especially regarding dynamic batching and model dependencies.

Want structured learning?

Take the full Triton course →