TensorRT-LLM is a library that optimizes large language models (LLMs) for inference on NVIDIA GPUs, and its throughput is typically measured in tokens per second.

Let’s see TensorRT-LLM in action with a quick benchmark setup. Imagine we have a model and we want to see how many tokens per second it can generate.

First, you’d typically build your TensorRT-LLM engine. This involves converting your model (e.g., from Hugging Face Transformers) into a TensorRT-optimized format. The command might look something like this:

python3 /path/to/TensorRT-LLM/examples/llama/bin/build_engine.py \
    --model_dir /path/to/your/llama/model \
    --precision float16 \
    --output_dir /path/to/output/engine \
    --max_batch_size 1 \
    --max_input_len 1024 \
    --max_output_len 1024 \
    --workspace_size 1073741824

Here, --model_dir points to your pre-trained LLM weights, --precision sets the numerical format (float16 is common for performance), and --max_batch_size, --max_input_len, and --max_output_len define the maximum sequence lengths the engine will handle. --workspace_size allocates memory for intermediate computations.

Once the engine is built, you can run a benchmark. The TensorRT-LLM examples often include a benchmarking script. A simplified execution might look like this:

python3 /path/to/TensorRT-LLM/examples/llama/bin/benchmark.py \
    --engine_dir /path/to/output/engine \
    --batch_sizes 1 4 8 \
    --input_len 512 \
    --output_len 128 \
    --num_runs 100 \
    --warmup_runs 10

This script will load the engine, run inference with specified batch sizes (--batch_sizes), input lengths (--input_len), and output lengths (--output_len), repeating the process (--num_runs) after some initial warm-up steps (--warmup_runs). The output will detail the average tokens per second achieved for each configuration.

The core problem TensorRT-LLM addresses is the computational overhead of LLM inference. LLMs involve massive matrix multiplications and complex attention mechanisms. Naively running these on standard frameworks can be slow due to suboptimal GPU utilization, memory access patterns, and lack of hardware-specific optimizations. TensorRT-LLM tackles this by:

  1. Graph Optimization: It analyzes the model’s computation graph and applies optimizations like layer fusion (combining multiple operations into a single kernel), kernel auto-tuning (selecting the best kernel implementation for the specific GPU architecture), and precision calibration (determining the optimal precision for each layer to balance accuracy and performance).
  2. Kernel Optimization: It provides highly optimized kernels for common LLM operations (like matrix multiplication, attention, and token sampling) that are tailored for NVIDIA Tensor Cores and specific GPU architectures.
  3. Quantization: It supports various quantization techniques (e.g., INT8, FP8) to reduce the memory footprint and computational cost of the model, often with minimal impact on accuracy.
  4. Efficient Memory Management: It employs strategies for managing GPU memory efficiently, reducing overhead associated with data movement and allocation.

When you’re benchmarking, you’re essentially measuring how many tokens the model can generate per second. This is calculated by dividing the total number of generated tokens across all sequences in a batch by the total time taken for that generation. For example, if a batch of 8 sequences each generates 128 tokens, and this takes 0.5 seconds, the throughput is (8 * 128) / 0.5 = 1024 tokens per second.

The most surprising thing about optimizing LLM throughput with TensorRT-LLM is how much performance can hinge on the max_output_len. While you might expect longer outputs to linearly increase generation time, the reality is more complex due to how TensorRT-LLM manages KV caches. When --max_output_len is set significantly higher than the actual tokens generated, the engine might allocate more KV cache memory than strictly necessary for that particular inference pass. However, the real impact comes from how TensorRT-LLM’s kernels are designed to process tokens sequentially within the output generation loop. If the --max_output_len is set to, say, 1024, but your prompts are short and the model generates only 10 tokens, TensorRT-LLM still has to manage the KV cache for up to 1024 tokens for each sequence in the batch. This can lead to increased memory bandwidth usage and potentially suboptimal kernel occupancy if the generated output is much shorter than the allocated capacity. Therefore, setting --max_output_len precisely to what you expect, or slightly above, is crucial for achieving peak efficiency, as it allows for more accurate KV cache management and thus better kernel utilization during the iterative generation process.

The next challenge you’ll likely encounter is tuning for different batch sizes and sequence lengths to find the sweet spot for your specific hardware and workload.

Want structured learning?

Take the full Tensorrt course →