Triton inference server can serve Hugging Face Transformers models, but it’s not just a simple wrapper; it fundamentally changes how you think about model deployment by abstracting away the underlying framework and hardware.
Let’s see it in action. Imagine you have a Hugging Face bert-base-uncased model. Normally, you’d load it with AutoModel.from_pretrained("bert-base-uncased") and then do a forward pass. With Triton, you’re going to package this model into a format Triton understands, using a config.pbtxt file and a model repository.
Here’s a snippet of what your Triton model repository might look like:
my_transformer_model/
├── config.pbtxt
└── 1/
└── model.pt
And the config.pbtxt:
name: "my_transformer_model"
platform: "pytorch_libtorch"
max_batch_size: 8
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "attention_mask"
data_type: TYPE_INT32
dims: [ -1 ]
}
]
output [
{
name: "last_hidden_state"
data_type: TYPE_FP32
dims: [ -1, -1 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]
The platform: "pytorch_libtorch" tells Triton to use its PyTorch backend. The input and output sections define the tensor names and shapes. Notice the -1 in dims: this signifies variable dimensions, crucial for handling sequences of different lengths. When you send a request to Triton, you’ll specify these input tensors by name, matching what’s in the config.
The real magic happens with the model.pt file. This isn’t just your raw PyTorch model. For Triton, you need to export your Hugging Face model to TorchScript using torch.jit.trace or torch.jit.script. This creates a serializable and optimized version of your model that Triton can load efficiently. A common way to do this for Hugging Face models involves getting the forward method and tracing it:
import torch
from transformers import AutoModel
model_name = "bert-base-uncased"
model = AutoModel.from_pretrained(model_name)
model.eval()
# Create dummy input to trace the model
# Batch size 1, sequence length 128
dummy_input_ids = torch.randint(0, 1000, (1, 128))
dummy_attention_mask = torch.ones((1, 128), dtype=torch.long)
# Trace the model
traced_model = torch.jit.trace(model, (dummy_input_ids, dummy_attention_mask))
# Save the traced model
traced_model.save("model.pt")
Once this model.pt and config.pbtxt are in your model repository, you can start the Triton server. Then, using a client (like tritonclient from NVIDIA), you can send requests.
A typical client request might look like this:
from tritonclient.http import InferenceServerClient
triton_client = InferenceServerClient(url="localhost:8000")
input_ids = np.random.randint(0, 30522, size=(2, 128)).astype(np.int32) # Batch size 2
attention_mask = np.ones((2, 128), dtype=np.int32)
inputs = [
triton_client.InferenceRequestInput(
name="input_ids",
shape=input_ids.shape,
datatype="INT32",
data=input_ids
),
triton_client.InferenceRequestInput(
name="attention_mask",
shape=attention_mask.shape,
datatype="INT32",
data=attention_mask
)
]
results = triton_client.infer(model_name="my_transformer_model", inputs=inputs)
output_data = results.as_numpy("last_hidden_state")
This client code is sending a batch of two sequences, each of length 128, to your deployed BERT model. Triton handles the batching, model loading, and execution on the GPU(s) specified in the config. The output output_data will contain the hidden states for your batch.
The true power here is Triton’s ability to handle dynamic batching, model versioning, and concurrent model execution. You can have multiple versions of your model deployed, and Triton can automatically group incoming requests into batches for higher throughput, all while abstracting away the PyTorch specifics. The platform: "pytorch_libtorch" is just one of many backends Triton supports (TensorFlow, ONNX Runtime, TensorRT, etc.), allowing you to deploy models from different frameworks seamlessly.
What most people miss is how Triton’s max_batch_size in config.pbtxt interacts with dynamic batching. If you set max_batch_size: 8 and don’t configure dynamic batching settings, Triton will only form batches up to size 8 if requests arrive precisely together. However, when you configure dynamic batching (e.g., using a dynamic_batch_infer_timeout_microseconds setting), Triton will wait for up to that timeout for more requests to arrive and fill the batch up to max_batch_size, significantly improving utilization. This timeout value is critical for balancing latency and throughput.
The next challenge you’ll likely face is optimizing these Transformer models for even faster inference, which often leads to exploring TensorRT.