Triton LLM Serving can leverage TensorRT-LLM as a backend to achieve highly optimized inference for large language models.
Let’s see it in action. Imagine we have a model.plan file generated by TensorRT-LLM, and we want to serve it with Triton.
First, we need to create a config.pbtxt file for Triton.
name: "tensorrt_llm_model"
backend: "tensorrtllm"
max_batch_size: 8
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "request_output_len"
data_type: TYPE_INT32
dims: [ 1 ]
}
]
output [
{
name: "output_ids"
data_type: TYPE_INT32
dims: [ -1, -1 ]
}
]
parameters {
key: "end_token_id"
value {
int_value: 100 # Example end token ID
}
}
parameters {
key: "start_token_id"
value {
int_value: 1 # Example start token ID
}
}
parameters {
key: "bad_words_ids"
value {
int_list {
values: 10 11
values: 12 13
}
}
}
parameters {
key: "stop_words_ids"
value {
int_list {
values: 20 21
}
}
}
The backend: "tensorrtllm" line is crucial here. It tells Triton to use the TensorRT-LLM backend.
The input and output sections define the expected tensors. input_ids is the sequence of tokens, and request_output_len specifies how many tokens to generate. output_ids will contain the generated sequence.
parameters allow us to control generation specifics. end_token_id signals completion, start_token_id is for prompt processing, and bad_words_ids and stop_words_ids provide custom stopping conditions.
Now, let’s consider how Triton actually uses TensorRT-LLM. When a request arrives, Triton’s TensorRT-LLM backend:
- Parses the request: It extracts
input_ids,request_output_len, and any specified generation parameters. - Prepares input tensors: It converts the input data into the format expected by the TensorRT engine.
- Invokes the TensorRT-LLM engine: This is where the heavy lifting happens. TensorRT-LLM, leveraging its optimized kernels and TensorRT’s graph optimizations, performs the forward pass and the auto-regressive generation.
- Processes output tensors: The generated
output_idsare formatted and returned.
The real magic of this integration lies in TensorRT-LLM’s ability to fuse operations, optimize memory usage, and utilize specialized hardware capabilities. Triton acts as the orchestrator, managing requests, batching, and exposing a standardized API while offloading the actual LLM inference to the highly tuned TensorRT-LLM engine.
Here’s a crucial detail that often surprises people: the request_output_len parameter isn’t just a hard limit on output tokens. It also influences the maximum number of tokens the model will attempt to generate in a single forward pass during the auto-regressive loop. If the model generates an end_token_id before reaching request_output_len, generation stops early. If it reaches request_output_len without an end_token_id, generation also stops. This dual role makes it a primary control for generation length.
The next step is understanding how to configure TensorRT-LLM itself, including building the model.plan with specific quantization and optimization settings.