You can run multiple versions of a model simultaneously and route traffic to them based on your own logic.
Here’s how you can set up A/B testing for model versions in Triton Inference Server.
Let’s say you have two versions of a resnet50 model: 1 and 2. You want to send 80% of your inference requests to version 1 and 20% to version 2.
First, ensure both model versions are correctly placed in your model repository. Your repository structure would look something like this:
model_repository/
resnet50/
1/
model.plan
schedule.json
2/
model.plan
schedule.json
The key to A/B testing in Triton lies in the schedule.json file. This file allows you to define how requests are routed to different model versions.
For version 1 (which will receive 80% of the traffic), your schedule.json would look like this:
{
"default": {
"priority": 1,
"weights": {
"1": 0.8
}
}
}
In this configuration:
"default": This is the primary routing configuration."priority": 1: Higher priority configurations are evaluated first."weights": {"1": 0.8}: This assigns a weight of0.8(80%) to the model version named1.
For version 2 (which will receive 20% of the traffic), your schedule.json would be:
{
"default": {
"priority": 1,
"weights": {
"2": 0.2
}
}
}
Here, "weights": {"2": 0.2} assigns a weight of 0.2 (20%) to model version 2.
When Triton loads the resnet50 model, it reads these schedule.json files. If multiple versions are present and have routing configurations, Triton will automatically apply the specified weights for traffic distribution.
To verify this is working, you can send a large number of inference requests to the resnet50 model endpoint and monitor the logs or use metrics to observe the distribution of requests across the two versions. You’d expect to see roughly 80% of requests being processed by version 1 and 20% by version 2.
Triton’s A/B testing is not limited to simple weight-based routing. You can also define priority levels and more complex routing rules based on request headers or other metadata, although weight-based distribution is the most common for A/B testing.
The schedule.json file is evaluated at model load time. If you need to change the traffic split dynamically without restarting Triton, you would typically need to reload the model or use Triton’s dynamic model management features.
The model repository structure is crucial. Triton expects each version of a model to be in its own subdirectory (e.g., 1, 2, 3) within the main model directory (e.g., resnet50). The schedule.json file resides at the same level as these version directories.
When you configure weights in schedule.json, Triton uses a probabilistic approach. For each incoming request, it randomly selects a model version to route to based on the assigned probabilities. This ensures a smooth distribution over a large number of requests.
A common pitfall is misinterpreting the priority field. Higher numerical values for priority indicate lower priority. Triton evaluates configurations from highest priority (lowest number) to lowest priority (highest number). If a request matches a higher priority configuration, it’s routed according to that configuration, and lower priority configurations are ignored for that request.
You might also encounter issues if the weights in your schedule.json files do not sum up to 1.0 across all configurations for a given model. Triton might default to a uniform distribution or error out depending on the exact configuration and Triton version. Always ensure your weights are normalized.
The next step you’ll likely explore is implementing canary deployments, where a small percentage of traffic is gradually shifted to a new model version to test its stability and performance before a full rollout.