Deploying Automatic Speech Recognition (ASR) models with Triton Inference Server is less about magic and more about orchestrating a highly optimized C++ inference engine with a Python-friendly API.
Let’s see Triton serve a speech-to-text model. We’ll use a pre-trained Whisper model, specifically the tiny.en variant for speed.
First, we need to get the model into Triton’s format. Triton uses a model.py and config.pbtxt for Python backend models.
# model.py
import json
import numpy as np
import triton_python_backend.llm.inference as inference
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
class TritonPythonModel:
def load_model(self, model_path, config):
self.model_dir = model_path
processor_path = config.get("processor_path", "openai/whisper-tiny.en")
model_path = config.get("model_path", "openai/whisper-tiny.en")
self.processor = WhisperProcessor.from_pretrained(processor_path)
self.model = WhisperForConditionalGeneration.from_pretrained(model_path)
# Move model to GPU if available
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model.to(self.device)
print(f"Model loaded: {model_path} on {self.device}")
def execute(self, requests):
responses = []
for request in requests:
input_audio_bytes = request.inputs[0].as_numpy()
# Assuming raw PCM 16-bit mono audio
input_audio = input_audio_bytes.astype(np.float32) / 32767.0
# Process audio
input_features = self.processor(input_audio, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(self.device)
# Generate speech
predicted_ids = self.model.generate(input_features)
transcription = self.processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
# Prepare response
output_data = np.array([transcription.encode('utf-8')], dtype=object)
response_metadata = {} # Add metadata if needed
responses.append(inference.InferenceResponse(output_data=output_data, metadata=response_metadata))
return responses
def finalize(self):
print("Cleaning up model resources.")
And the config.pbtxt:
name: "whisper_asr"
platform: "python"
max_batch_size: 8
input [
{
name: "AUDIO_INPUT"
data_type: TYPE_FP32
dims: [ 16000 ] # Example: assuming 1 second of 16kHz mono audio
}
]
output [
{
name: "TRANSCRIPTION_OUTPUT"
data_type: TYPE_STRING
dims: [ 1 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]
parameters {
key: "processor_path"
value {
string_value: "openai/whisper-tiny.en"
}
}
parameters {
key: "model_path"
value {
string_value: "openai/whisper-tiny.en"
}
}
To serve this, you’d place these files in a directory structure like:
models/
whisper_asr/
1/
model.py
config.pbtxt
And start Triton with tritonserver --model-repository=/path/to/models.
The model.py script defines the TritonPythonModel class, which Triton instantiates. load_model is called once to set up the Python objects (like the Hugging Face WhisperProcessor and WhisperForConditionalGeneration models). execute is where the actual inference happens for each batch of requests. It takes raw audio bytes, converts them to the expected float format, passes them through the processor and model, and returns the transcribed text as a string.
The config.pbtxt tells Triton about the model’s name, platform (Python backend), batching capabilities, input/output tensor names and types, and crucially, how many instances to run and on which hardware. The parameters section allows passing configuration options to load_model.
A key detail is the input.dims: [16000]. This defines the expected shape of a single input sample. For ASR, this shape often corresponds to a fixed duration of audio (e.g., 30 seconds of 16kHz audio would be [480000]). Triton’s dynamic batching can group multiple such inputs into a single inference request to the model, but each individual input must conform to this declared dimension. If your audio varies in length, you’ll need to pad or truncate it to this fixed size before sending it to Triton, or use a more advanced model that handles variable-length inputs internally.
When you send a request to Triton, say via curl or a Python client, you’re sending raw audio data. The AUDIO_INPUT tensor receives this. Triton then invokes the execute method of your TritonPythonModel instance. The WhisperProcessor handles the conversion of raw audio samples into the Mel spectrogram features that the Whisper model expects. The model.generate method performs the autoregressive decoding to produce the token IDs, which are then decoded back into human-readable text by the processor.
The instance_group specifying KIND_GPU and gpus: [0] is vital for performance. For transformer-based models like Whisper, GPU acceleration is almost always necessary for reasonable latency. If you omit this, Triton will try to run it on the CPU, which will be significantly slower.
The most surprising aspect for many is how Triton orchestrates Python execution. It doesn’t just run your model.py as a standalone script. Instead, it manages a pool of Python interpreters, loading your model code into them. When a request arrives, Triton dispatches it to an available interpreter that has your model loaded. This allows Triton to maintain its high-throughput, low-overhead C++ core while leveraging Python’s rich ecosystem for model implementation.
The next step after getting basic ASR working is handling audio chunking and streaming for real-time transcription, which involves managing state across multiple inference requests.