Triton’s priority scheduling is not about making some requests faster than others; it’s about ensuring that the most important requests don’t get starved by a flood of less important ones.
Let’s see this in action. Imagine a Triton inference server handling requests for two different models: image_classification (high priority) and object_detection (low priority). We’ve configured a priority queue where image_classification gets a higher slot.
Here’s a snippet of a hypothetical Triton configuration file (config.pbtxt) that illustrates this:
name: "my_inference_server"
platform: "tensorflow_saved_model"
max_batch_size: 8
model_config_list {
config {
name: "image_classification"
platform: "tensorflow_saved_model"
max_batch_size: 8
instance_group {
count: 2
kind: KIND_GPU
}
dynamic_batching {
preferred_batch_size: [4, 8]
max_queue_delay_microseconds: 100000
}
priority_level: 10 # Higher priority
}
config {
name: "object_detection"
platform: "tensorflow_saved_model"
max_batch_size: 8
instance_group {
count: 2
kind: KIND_GPU
}
dynamic_batching {
preferred_batch_size: [4, 8]
max_queue_delay_microseconds: 100000
}
priority_level: 1 # Lower priority
}
}
# This is the key for priority scheduling
priority_queue_config {
low_priority_threshold_count: 2
high_priority_queue_size: 10
low_priority_queue_size: 10
}
In this setup, image_classification is assigned priority_level: 10 and object_detection gets priority_level: 1. The priority_queue_config defines the behavior: low_priority_threshold_count: 2 means that if there are 2 or more requests for any model waiting in the queue, the server starts considering priority. high_priority_queue_size: 10 and low_priority_queue_size: 10 set the respective buffer sizes before requests are dropped or rejected.
Now, imagine a surge of requests. If only object_detection requests arrive, they’ll be processed normally, filling up their queue slots. But as soon as a few image_classification requests come in, Triton’s scheduler will prioritize them. Even if the object_detection queue is full, a high-priority image_classification request will be immediately dispatched to an available GPU instance, effectively jumping ahead. This prevents a backlog of low-priority traffic from blocking critical, high-priority inferences.
The problem Triton’s priority scheduling solves is resource starvation in the face of mixed-priority workloads. Without it, a high-volume, low-value request stream could saturate the system, making it impossible for critical, low-volume requests to get through in a timely manner. Triton’s scheduler uses a weighted fair-queuing approach, where higher priority_level values grant a request a larger share of processing time when contention occurs. The priority_level is a numerical value; the higher the number, the higher the priority.
The priority_queue_config is where the magic is tuned. low_priority_threshold_count acts as a gatekeeper. If the total number of pending requests across all models is below this threshold, the server might operate in a more "first-come, first-served" manner within each model’s queue. Once this threshold is met, the priority system becomes more active. high_priority_queue_size and low_priority_queue_size control how many requests of each priority level are buffered. If these queues overflow, Triton will start rejecting new requests for that priority level, preventing unbounded memory growth and signaling to clients that the system is overloaded.
When you set priority_level for a model, you’re not just assigning a number; you’re defining its "weight" in the scheduler’s decision-making process. A request with priority_level: 10 is effectively 10 times more likely to be selected for execution than a request with priority_level: 1 when both are competing for a limited resource. The actual dispatch logic is more nuanced, involving factors like current queue lengths and recent processing times, but the priority_level is the primary knob.
What most people don’t realize is that the priority_level is a relative value. If all your models have priority_level: 1, they are all treated equally. If you have one model at priority_level: 100 and all others at priority_level: 1, the high-priority model will almost exclusively consume GPU resources during periods of contention, potentially leading to starvation for the lower-priority models. This means you need to carefully consider the relative importance of your models to avoid creating new bottlenecks.
The next step in managing Triton’s throughput is understanding dynamic batching and its interaction with priority scheduling.