Triton Inference Server can serve NLP models, but the data preprocessing pipeline is often the most overlooked part of getting it to work.
Let’s say you have a text classification model trained with Hugging Face’s transformers library, specifically using a Tokenizer and a Model. You want to serve this model using Triton.
Here’s a typical workflow when you’re first trying to get this set up:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load a pre-trained model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example text
text = "This is a great movie, I really enjoyed it!"
# Tokenize the text
inputs = tokenizer(text, return_tensors="pt")
# Get model outputs
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class_id = logits.argmax().item()
print(f"Predicted class ID: {predicted_class_id}")
print(f"Predicted label: {model.config.id2label[predicted_class_id]}")
This works perfectly fine when running locally. But when you move to Triton, you realize Triton expects raw tensors for input, not raw text strings. The tokenization step, which is crucial for NLP models, needs to be handled before the model receives its input.
There are a few ways to tackle this, and the most robust approach involves using Triton’s Ensemble Model feature combined with a Python Backend.
The Problem: Where Does Tokenization Happen?
Your trained model (e.g., distilbert-base-uncased-finetuned-sst-2-english) expects numerical inputs: input_ids, attention_mask, and potentially token_type_ids. These are generated by the tokenizer. If you just try to send the raw text string to a Triton model directly, it will fail because the model doesn’t know how to process strings.
The Solution: An Ensemble Model with a Python Backend
Triton’s Python backend allows you to execute arbitrary Python code as part of the inference pipeline. We can leverage this to perform the tokenization. We’ll create two components:
- A Python Backend Model: This model will take the raw text string as input, use the
transformerstokenizer to convert it into the necessary tensors (input_ids,attention_mask), and then pass these tensors to the actual NLP model. - A Saved NLP Model: This will be your original
transformersmodel, saved in a format Triton can understand (e.g., ONNX, TensorFlow SavedModel, PyTorch TorchScript).
Then, we’ll define an Ensemble Model that orchestrates these two. The ensemble will expose an interface that accepts raw text and internally calls the Python backend, which in turn calls the saved NLP model.
Step 1: Save Your NLP Model
First, you need to convert your PyTorch model into a format Triton can load. ONNX is a popular choice.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create a dummy input matching the tokenizer's expected format
# Max sequence length is often a good default, or use the model's config.max_position_embeddings
dummy_input = tokenizer("This is a test.", return_tensors="pt")
# Export to ONNX
torch.onnx.export(
model,
(dummy_input['input_ids'], dummy_input['attention_mask']), # Tuple of inputs
"model.onnx",
input_names=['input_ids', 'attention_mask'],
output_names=['logits'],
dynamic_axes={
'input_ids': {0: 'batch_size', 1: 'sequence_length'},
'attention_mask': {0: 'batch_size', 1: 'sequence_length'}
},
opset_version=11 # Or a version supported by your Triton/ONNX runtime
)
Save this model.onnx file in a directory that Triton will load.
Step 2: Create the Python Backend Model
Create a Python script (e.g., python_backend.py) that will perform the tokenization.
import json
import numpy as np
import triton_python_backend.core as core
from transformers import AutoTokenizer
# Load the tokenizer once when the backend is initialized
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
class TritonPythonModel:
def initialize(self, config):
# self.model_config = config
pass
def execute(self, requests):
responses = []
for request in requests:
# Get the raw text input from the request
# Assuming the input tensor is named 'TEXT' and contains strings
text_input = self.get_input_tensor(request, 'TEXT')
input_texts = [s.decode('utf-8') for s in text_input] # Decode bytes to string
# Tokenize the input texts
tokenized_inputs = tokenizer(
input_texts,
padding=True, # Pad to the longest sequence in the batch
truncation=True, # Truncate to max model length
return_tensors="np" # Return numpy arrays
)
# Prepare output tensors for the ONNX model
# The ONNX model expects 'input_ids' and 'attention_mask'
input_ids_tensor = core.Tensor.from_numpy(tokenized_inputs['input_ids'], dtype=core.DataType.INT64)
attention_mask_tensor = core.Tensor.from_numpy(tokenized_inputs['attention_mask'], dtype=core.DataType.INT64)
# Construct the output message to be sent to the next model in the ensemble
# This message will be interpreted by the ensemble model's configuration
output_message = {
"input_ids": input_ids_tensor,
"attention_mask": attention_mask_tensor
}
responses.append(output_message)
return responses
def get_input_tensor(self, request, name):
# Helper to extract tensor from request
# This needs to match the input tensor name defined in the ensemble config
input_data = request.inputs.get(name)
if not input_data:
raise ValueError(f"Input tensor '{name}' not found in request.")
return input_data.as_numpy()
Save this script as python_backend.py.
Step 3: Configure Triton
You’ll need a config.pbtxt for your ONNX model, a config.pbtxt for your Python backend, and an ensemble_config.pbtxt for the ensemble.
1. onnx_model/config.pbtxt:
name: "onnx_nlp_model"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [ -1 ] # -1 indicates dynamic dimension for sequence length
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [ -1 ]
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ -1 ] # Number of classes
}
]
2. python_backend/config.pbtxt:
name: "python_tokenizer"
platform: "python"
max_batch_size: 8
input [
{
name: "TEXT"
data_type: TYPE_STRING
dims: [ -1 ] # -1 for dynamic batch size
}
]
output [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [ -1 ]
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [ -1 ]
}
]
# Specify the Python backend script and class
instance_group [
{
count: 1
kind: KIND_CPU
}
]
backend: "python_backend.py"
3. ensemble_nlp/config.pbtxt:
name: "ensemble_nlp"
platform: "ensemble"
max_batch_size: 8
input [
{
name: "TEXT"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ -1 ]
}
]
ensemble_scheduling {
step [
{
model_name: "python_tokenizer"
model_version: -1 # Use the latest version
input_map {
key: "TEXT"
value: "TEXT" # Map ensemble input 'TEXT' to python_tokenizer input 'TEXT'
}
output_map {
key: "input_ids"
value: "input_ids" # Map python_tokenizer output 'input_ids' to an intermediate name
}
output_map {
key: "attention_mask"
value: "attention_mask" # Map python_tokenizer output 'attention_mask' to an intermediate name
}
},
{
model_name: "onnx_nlp_model"
model_version: -1
input_map {
key: "input_ids"
value: "input_ids" # Map intermediate name 'input_ids' to onnx_nlp_model input 'input_ids'
}
input_map {
key: "attention_mask"
value: "attention_mask" # Map intermediate name 'attention_mask' to onnx_nlp_model input 'attention_mask'
}
output_map {
key: "logits"
value: "logits" # Map onnx_nlp_model output 'logits' to ensemble output 'logits'
}
}
]
}
Place these config.pbtxt files and the model.onnx and python_backend.py into the correct directories within your Triton model repository structure.
How it Works
When you send a request to the ensemble_nlp model with a raw text string:
- The
ensemble_nlpconfig receives theTEXTinput. - It forwards this
TEXTinput to thepython_tokenizermodel (the Python backend). - The
python_tokenizer’sexecutemethod runs. It decodes the string, tokenizes it usingtransformers, and returnsinput_idsandattention_maskas NumPy arrays wrapped in Triton tensors. - The ensemble then takes these intermediate
input_idsandattention_masktensors and sends them to theonnx_nlp_model. - The
onnx_nlp_model(your ONNX-exported Hugging Face model) processes these tensors and outputslogits. - Finally, the ensemble maps the
logitsoutput from the ONNX model to its own output namedlogitsand returns it to the client.
This setup cleanly separates the preprocessing (tokenization) from the model inference, allowing you to serve complex NLP models with a user-friendly text input.
The next challenge is often handling different sequence lengths and batching strategies efficiently.