Triton can serve FP16 and INT8 quantized models, but its default behavior might surprise you with how it handles the precision.
Let’s see Triton serving a quantized model. Imagine we have a simple ONNX model that’s been quantized to INT8.
import torch
import torch.nn as nn
import onnxruntime as ort
import numpy as np
# Create a dummy model
class DummyModel(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(3, 16, 3, padding=1)
self.relu = nn.ReLU()
self.fc = nn.Linear(16 * 32 * 32, 10) # Assuming input image size 3x32x32
def forward(self, x):
x = self.conv(x)
x = self.relu(x)
x = x.view(x.size(0), -1)
x = self.fc(x)
return x
model = DummyModel()
dummy_input = torch.randn(1, 3, 32, 32)
torch.onnx.export(model, dummy_input, "dummy_model.onnx", opset_version=11)
# Now, let's simulate quantization (this is a placeholder, real quantization is more complex)
# For demonstration, we'll just export a float model and tell Triton it's INT8
# In a real scenario, you'd use tools like ONNX Runtime's quantization API or TensorRT
print("Model exported to dummy_model.onnx (as FP32 for now).")
Now, we’ll configure Triton to serve this model, pretending it’s INT8. This is where the interesting part begins.
Here’s a simplified config.pbtxt for Triton:
name: "quantized_model"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
{
name: "input"
data_type: TYPE_FP32 # Even though we *want* INT8, ONNX Runtime often expects input in FP32
dims: [3, 32, 32]
}
]
output [
{
name: "output"
data_type: TYPE_FP32 # Output is also typically FP32 from ONNX Runtime
dims: [10]
}
]
model_version: 1
instance_group [
{
count: 1
kind: KIND_CPU
}
]
# This is the key part: specifying the input and output tensor names
# from the ONNX model.
# For INT8, you might also see specific TensorRT or ONNX Runtime
# configurations here, but for ONNX Runtime backend, it's often
# about how the model itself is quantized.
The surprise is how Triton’s onnxruntime_onnx backend interacts with quantized ONNX models. By default, ONNX Runtime might still perform computations in FP32 even if the model graph contains quantized operations. The data_type in config.pbtxt for inputs and outputs typically refers to the host data type (what your client sends and receives), not necessarily the internal computation type of the model.
To truly leverage INT8 inference, the ONNX model itself must be quantized. This means the weights and activations within the ONNX file are represented as INT8. ONNX Runtime then needs to be configured to use these INT8 kernels. This often involves setting specific execution providers or environment variables.
For example, if you were using ONNX Runtime with the CUDAExecutionProvider and had a model quantized specifically for INT8 inference on GPU, your config.pbtxt might look slightly different, and you’d ensure the ONNX Runtime installation supports it.
name: "quantized_model_gpu"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
{
name: "input"
data_type: TYPE_FP32 # Client input is still FP32
dims: [3, 32, 32]
}
]
output [
{
name: "output"
data_type: TYPE_FP32 # Client output is FP32
dims: [10]
}
]
model_version: 1
instance_group [
{
count: 1
kind: KIND_GPU # Using GPU
gpus: [0]
}
]
# For ONNX Runtime on GPU with INT8, you'd typically need to ensure
# the ONNX Runtime build has TensorRT support enabled and that
# the model is compatible with TensorRT's INT8 calibration and kernels.
# This is often handled outside of the Triton config.pbtxt itself,
# by how the ONNX model was generated (e.g., via TensorRT's builder).
The key levers you control are:
- The ONNX Model: The model must be quantized (INT8 weights and activations) for INT8 inference. Triton doesn’t magically quantize a float model. You’d use tools like TensorRT, ONNX Runtime’s quantization toolkit, or vendor-specific tools to produce this INT8 ONNX file.
- Triton Backend Configuration: For
onnxruntime_onnx, it’s about ensuring ONNX Runtime is built with the necessary optimizations (like TensorRT EP for GPU) and that the model is discoverable. Theplatformfield is crucial here. - Client Data Types: The
data_typefields inconfig.pbtxtdefine the data types for data transfer between the client and Triton. Even if the model runs in INT8 internally, the client might still send FP32 and receive FP32, with Triton and the backend handling the conversion implicitly or explicitly.
The one thing most people don’t realize is that simply having an INT8 ONNX file doesn’t guarantee INT8 execution by default. ONNX Runtime (and by extension, Triton’s ONNX Runtime backend) needs to be explicitly told or configured to utilize INT8 kernels, often through specific execution providers (like TensorRT EP on NVIDIA GPUs) or by setting environment variables that guide the runtime’s kernel selection. If these aren’t set up correctly, ONNX Runtime might fall back to FP32 computations, negating the benefits of quantization.
The next step is to explore how to configure Triton to use specific execution providers for optimized quantized inference.