BERT and other Transformer models are notoriously slow for inference, and TensorRT is the go-to solution for speeding them up. But it’s not just about dropping a model into TensorRT and expecting magic. The real trick is understanding how TensorRT optimizes these specific architectures and what levers you can pull to get the most out of them.

Let’s see this in action. Imagine we have a pre-trained BERT model (bert-base-uncased) and we want to use it for sequence classification.

import torch
from transformers import BertModel, BertTokenizer
from loguru import logger

# Load pre-trained model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
model.eval()

# Example input
text = "TensorRT is amazing for NLP inference!"
encoded_input = tokenizer(text, return_tensors='pt')

# Run inference with PyTorch
with torch.no_grad():
    pytorch_output = model(**encoded_input)

logger.info(f"PyTorch output shape: {pytorch_output.last_hidden_state.shape}")

This gives us the raw output, but it’s slow. To speed this up, we’d typically convert this PyTorch model into a TensorRT engine. The surprising part is that TensorRT doesn’t just optimize the PyTorch graph; it often rewrites parts of the Transformer architecture itself to leverage optimized kernels.

The core idea behind TensorRT optimization for Transformers is to fuse operations and use specialized kernels. For example, the multi-head self-attention mechanism, a bottleneck in Transformers, is a prime candidate for fusion. TensorRT can fuse the multiple matrix multiplications, softmax, and other operations within self-attention into a single, highly optimized kernel. This dramatically reduces kernel launch overhead and memory access.

Here’s a simplified look at how you might convert and run a Hugging Face Transformer model with TensorRT. This involves several steps:

  1. Export to ONNX: PyTorch models are usually exported to ONNX first.
  2. Build TensorRT Engine: Use trtexec or the TensorRT Python API to build an engine from the ONNX file.
  3. Inference with TensorRT: Load the engine and run inference.
# --- Step 1: Export to ONNX (conceptual, requires onnxruntime and torch.onnx) ---
# import onnx
# from onnx import numpy_helper
# import numpy as np
#
# # ... (load model and tokenizer as above) ...
#
# dummy_input = {
#     "input_ids": torch.randint(0, 30522, (1, 128)),
#     "attention_mask": torch.ones((1, 128), dtype=torch.long)
# }
#
# torch.onnx.export(
#     model,
#     (dummy_input["input_ids"], dummy_input["attention_mask"]), # Pass inputs as a tuple
#     "bert_model.onnx",
#     input_names=['input_ids', 'attention_mask'],
#     output_names=['last_hidden_state'], # Assuming you want the last hidden state
#     dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence_length'},
#                   'attention_mask': {0: 'batch_size', 1: 'sequence_length'},
#                   'last_hidden_state': {0: 'batch_size', 1: 'sequence_length'}}
# )
# logger.info("Model exported to bert_model.onnx")

# --- Step 2: Build TensorRT Engine using trtexec (command line) ---
# Assuming you have the ONNX file and TensorRT installed:
# trtexec --onnx=bert_model.onnx \
#         --saveEngine=bert_engine.trt \
#         --inputIOFormats=int32:chw \
#         --outputIOFormats=fp16:chw \
#         --fp16 --maxBatch=1 --minShapes=input_ids:1x16,attention_mask:1x16 \
#         --optShapes=input_ids:1x128,attention_mask:1x128 \
#         --maxShapes=input_ids:1x512,attention_mask:1x512 \
#         --workspace=4096

# --- Step 3: Inference with TensorRT (conceptual, requires tensorrt) ---
# import tensorrt as trt
#
# # ... (TensorRT engine loading and inference logic) ...
# logger.info("TensorRT engine built and ready for inference.")

When building the TensorRT engine, several parameters are crucial. trtexec (or the TensorRT API) allows you to specify minShapes, optShapes, and maxShapes for dynamic batching and sequence lengths. This tells TensorRT the range of input dimensions it needs to optimize for. Without this, TensorRT might assume fixed shapes, limiting its applicability. The --fp16 flag is also key, enabling half-precision inference which significantly speeds up computation and reduces memory bandwidth requirements for compatible hardware.

The magic happens in the TensorRT optimizer. It analyzes the ONNX graph and identifies opportunities for optimization. For BERT, this includes:

  • Kernel Fusion: Combining multiple operations (e.g., MatMul + BiasAdd + ReLU in a feed-forward network) into a single, highly efficient CUDA kernel.
  • Layer Normalization and Softmax Optimization: TensorRT has specialized kernels for these common Transformer operations that are faster than generic implementations.
  • Quantization: Converting weights and activations from FP32 to FP16 or INT8 can further boost performance, though it might require calibration for INT8.
  • Memory Optimizations: Efficiently reusing memory buffers to minimize data movement.
  • Kernel Auto-Tuning: TensorRT can select the best implementation for a given operation on your specific hardware.

One subtle but powerful aspect is how TensorRT handles the attention mask. Instead of treating it as a separate operation that modifies intermediate tensors, TensorRT’s optimized attention kernels often integrate the mask logic directly into the matrix multiplication and softmax computations. This avoids extra memory reads and writes, making the masking operation almost "free" from a performance perspective.

The next hurdle is often optimizing for different input sequence lengths efficiently, which brings you into the realm of dynamic tensor shapes and efficient batching strategies.

Want structured learning?

Take the full Tensorrt course →