Triton LLM Serving can leverage TensorRT-LLM as a backend to achieve highly optimized inference for large language models.

Let’s see it in action. Imagine we have a model.plan file generated by TensorRT-LLM, and we want to serve it with Triton.

First, we need to create a config.pbtxt file for Triton.

name: "tensorrt_llm_model"
backend: "tensorrtllm"
max_batch_size: 8
input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  }
]
parameters {
  key: "end_token_id"
  value {
    int_value: 100 # Example end token ID
  }
}
parameters {
  key: "start_token_id"
  value {
    int_value: 1 # Example start token ID
  }
}
parameters {
  key: "bad_words_ids"
  value {
    int_list {
      values: 10 11
      values: 12 13
    }
  }
}
parameters {
  key: "stop_words_ids"
  value {
    int_list {
      values: 20 21
    }
  }
}

The backend: "tensorrtllm" line is crucial here. It tells Triton to use the TensorRT-LLM backend.

The input and output sections define the expected tensors. input_ids is the sequence of tokens, and request_output_len specifies how many tokens to generate. output_ids will contain the generated sequence.

parameters allow us to control generation specifics. end_token_id signals completion, start_token_id is for prompt processing, and bad_words_ids and stop_words_ids provide custom stopping conditions.

Now, let’s consider how Triton actually uses TensorRT-LLM. When a request arrives, Triton’s TensorRT-LLM backend:

  1. Parses the request: It extracts input_ids, request_output_len, and any specified generation parameters.
  2. Prepares input tensors: It converts the input data into the format expected by the TensorRT engine.
  3. Invokes the TensorRT-LLM engine: This is where the heavy lifting happens. TensorRT-LLM, leveraging its optimized kernels and TensorRT’s graph optimizations, performs the forward pass and the auto-regressive generation.
  4. Processes output tensors: The generated output_ids are formatted and returned.

The real magic of this integration lies in TensorRT-LLM’s ability to fuse operations, optimize memory usage, and utilize specialized hardware capabilities. Triton acts as the orchestrator, managing requests, batching, and exposing a standardized API while offloading the actual LLM inference to the highly tuned TensorRT-LLM engine.

Here’s a crucial detail that often surprises people: the request_output_len parameter isn’t just a hard limit on output tokens. It also influences the maximum number of tokens the model will attempt to generate in a single forward pass during the auto-regressive loop. If the model generates an end_token_id before reaching request_output_len, generation stops early. If it reaches request_output_len without an end_token_id, generation also stops. This dual role makes it a primary control for generation length.

The next step is understanding how to configure TensorRT-LLM itself, including building the model.plan with specific quantization and optimization settings.

Want structured learning?

Take the full Triton course →