Triton Inference Server can serve NLP models, but the data preprocessing pipeline is often the most overlooked part of getting it to work.

Let’s say you have a text classification model trained with Hugging Face’s transformers library, specifically using a Tokenizer and a Model. You want to serve this model using Triton.

Here’s a typical workflow when you’re first trying to get this set up:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load a pre-trained model and tokenizer
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text
text = "This is a great movie, I really enjoyed it!"

# Tokenize the text
inputs = tokenizer(text, return_tensors="pt")

# Get model outputs
with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
predicted_class_id = logits.argmax().item()
print(f"Predicted class ID: {predicted_class_id}")
print(f"Predicted label: {model.config.id2label[predicted_class_id]}")

This works perfectly fine when running locally. But when you move to Triton, you realize Triton expects raw tensors for input, not raw text strings. The tokenization step, which is crucial for NLP models, needs to be handled before the model receives its input.

There are a few ways to tackle this, and the most robust approach involves using Triton’s Ensemble Model feature combined with a Python Backend.

The Problem: Where Does Tokenization Happen?

Your trained model (e.g., distilbert-base-uncased-finetuned-sst-2-english) expects numerical inputs: input_ids, attention_mask, and potentially token_type_ids. These are generated by the tokenizer. If you just try to send the raw text string to a Triton model directly, it will fail because the model doesn’t know how to process strings.

The Solution: An Ensemble Model with a Python Backend

Triton’s Python backend allows you to execute arbitrary Python code as part of the inference pipeline. We can leverage this to perform the tokenization. We’ll create two components:

  1. A Python Backend Model: This model will take the raw text string as input, use the transformers tokenizer to convert it into the necessary tensors (input_ids, attention_mask), and then pass these tensors to the actual NLP model.
  2. A Saved NLP Model: This will be your original transformers model, saved in a format Triton can understand (e.g., ONNX, TensorFlow SavedModel, PyTorch TorchScript).

Then, we’ll define an Ensemble Model that orchestrates these two. The ensemble will expose an interface that accepts raw text and internally calls the Python backend, which in turn calls the saved NLP model.

Step 1: Save Your NLP Model

First, you need to convert your PyTorch model into a format Triton can load. ONNX is a popular choice.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create a dummy input matching the tokenizer's expected format
# Max sequence length is often a good default, or use the model's config.max_position_embeddings
dummy_input = tokenizer("This is a test.", return_tensors="pt")

# Export to ONNX
torch.onnx.export(
    model,
    (dummy_input['input_ids'], dummy_input['attention_mask']), # Tuple of inputs
    "model.onnx",
    input_names=['input_ids', 'attention_mask'],
    output_names=['logits'],
    dynamic_axes={
        'input_ids': {0: 'batch_size', 1: 'sequence_length'},
        'attention_mask': {0: 'batch_size', 1: 'sequence_length'}
    },
    opset_version=11 # Or a version supported by your Triton/ONNX runtime
)

Save this model.onnx file in a directory that Triton will load.

Step 2: Create the Python Backend Model

Create a Python script (e.g., python_backend.py) that will perform the tokenization.

import json
import numpy as np
import triton_python_backend.core as core
from transformers import AutoTokenizer

# Load the tokenizer once when the backend is initialized
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

class TritonPythonModel:
    def initialize(self, config):
        # self.model_config = config
        pass

    def execute(self, requests):
        responses = []
        for request in requests:
            # Get the raw text input from the request
            # Assuming the input tensor is named 'TEXT' and contains strings
            text_input = self.get_input_tensor(request, 'TEXT')
            input_texts = [s.decode('utf-8') for s in text_input] # Decode bytes to string

            # Tokenize the input texts
            tokenized_inputs = tokenizer(
                input_texts,
                padding=True, # Pad to the longest sequence in the batch
                truncation=True, # Truncate to max model length
                return_tensors="np" # Return numpy arrays
            )

            # Prepare output tensors for the ONNX model
            # The ONNX model expects 'input_ids' and 'attention_mask'
            input_ids_tensor = core.Tensor.from_numpy(tokenized_inputs['input_ids'], dtype=core.DataType.INT64)
            attention_mask_tensor = core.Tensor.from_numpy(tokenized_inputs['attention_mask'], dtype=core.DataType.INT64)

            # Construct the output message to be sent to the next model in the ensemble
            # This message will be interpreted by the ensemble model's configuration
            output_message = {
                "input_ids": input_ids_tensor,
                "attention_mask": attention_mask_tensor
            }
            responses.append(output_message)
        return responses

    def get_input_tensor(self, request, name):
        # Helper to extract tensor from request
        # This needs to match the input tensor name defined in the ensemble config
        input_data = request.inputs.get(name)
        if not input_data:
            raise ValueError(f"Input tensor '{name}' not found in request.")
        return input_data.as_numpy()

Save this script as python_backend.py.

Step 3: Configure Triton

You’ll need a config.pbtxt for your ONNX model, a config.pbtxt for your Python backend, and an ensemble_config.pbtxt for the ensemble.

1. onnx_model/config.pbtxt:

name: "onnx_nlp_model"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1 ] # -1 indicates dynamic dimension for sequence length
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1 ]
  }
]
output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1 ] # Number of classes
  }
]

2. python_backend/config.pbtxt:

name: "python_tokenizer"
platform: "python"
max_batch_size: 8
input [
  {
    name: "TEXT"
    data_type: TYPE_STRING
    dims: [ -1 ] # -1 for dynamic batch size
  }
]
output [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1 ]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1 ]
  }
]
# Specify the Python backend script and class
instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]
backend: "python_backend.py"

3. ensemble_nlp/config.pbtxt:

name: "ensemble_nlp"
platform: "ensemble"
max_batch_size: 8
input [
  {
    name: "TEXT"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]
output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "python_tokenizer"
      model_version: -1 # Use the latest version
      input_map {
        key: "TEXT"
        value: "TEXT" # Map ensemble input 'TEXT' to python_tokenizer input 'TEXT'
      }
      output_map {
        key: "input_ids"
        value: "input_ids" # Map python_tokenizer output 'input_ids' to an intermediate name
      }
      output_map {
        key: "attention_mask"
        value: "attention_mask" # Map python_tokenizer output 'attention_mask' to an intermediate name
      }
    },
    {
      model_name: "onnx_nlp_model"
      model_version: -1
      input_map {
        key: "input_ids"
        value: "input_ids" # Map intermediate name 'input_ids' to onnx_nlp_model input 'input_ids'
      }
      input_map {
        key: "attention_mask"
        value: "attention_mask" # Map intermediate name 'attention_mask' to onnx_nlp_model input 'attention_mask'
      }
      output_map {
        key: "logits"
        value: "logits" # Map onnx_nlp_model output 'logits' to ensemble output 'logits'
      }
    }
  ]
}

Place these config.pbtxt files and the model.onnx and python_backend.py into the correct directories within your Triton model repository structure.

How it Works

When you send a request to the ensemble_nlp model with a raw text string:

  1. The ensemble_nlp config receives the TEXT input.
  2. It forwards this TEXT input to the python_tokenizer model (the Python backend).
  3. The python_tokenizer’s execute method runs. It decodes the string, tokenizes it using transformers, and returns input_ids and attention_mask as NumPy arrays wrapped in Triton tensors.
  4. The ensemble then takes these intermediate input_ids and attention_mask tensors and sends them to the onnx_nlp_model.
  5. The onnx_nlp_model (your ONNX-exported Hugging Face model) processes these tensors and outputs logits.
  6. Finally, the ensemble maps the logits output from the ONNX model to its own output named logits and returns it to the client.

This setup cleanly separates the preprocessing (tokenization) from the model inference, allowing you to serve complex NLP models with a user-friendly text input.

The next challenge is often handling different sequence lengths and batching strategies efficiently.

Want structured learning?

Take the full Triton course →