The most surprising thing about Triton’s client libraries is how much they abstract away the fundamental differences between HTTP and gRPC, allowing you to write nearly identical code for both.

Let’s see it in action. Imagine we have a simple model deployed on Triton. We’ll send a request using the Python client, first over HTTP, then over gRPC, and observe the results.

import numpy as np
from tritonclient.http import InferenceServerClient
from tritonclient.grpc import InferenceServerClient as gRPCInferenceServerClient

# Assume Triton is running on localhost:8000 for HTTP and localhost:8001 for gRPC
# And a model named 'my_model' is deployed.

# --- HTTP Request ---
http_client = InferenceServerClient(url="localhost:8000")
input_data = np.random.rand(1, 3).astype(np.float32)
inputs = [http_client.infer_input("my_input_name", input_data.shape, "FP32")]
inputs[0].set_data_from_numpy(input_data)

http_results = http_client.infer("my_model", inputs)
http_output = http_client.infer_output("my_output_name")
http_results = http_client.infer_output("my_output_name", http_results)
print("HTTP Output:", http_results.as_numpy("my_output_name"))

# --- gRPC Request ---
grpc_client = gRPCInferenceServerClient(url="localhost:8001")
# Reusing input_data and model name
grpc_inputs = [grpc_client.infer_input("my_input_name", input_data.shape, "FP32")]
grpc_inputs[0].set_data_from_numpy(input_data)

grpc_results = grpc_client.infer("my_model", grpc_inputs)
grpc_output = grpc_client.infer_output("my_output_name")
grpc_results = grpc_client.infer_output("my_output_name", grpc_results)
print("gRPC Output:", grpc_results.as_numpy("my_output_name"))

Notice how the core infer call and input/output object creation are remarkably similar. The tritonclient library handles the underlying protocol details.

At its heart, Triton’s Python client is a bridge. It translates your Python data structures and requests into the specific wire formats required by either the HTTP/REST API or the gRPC API of the Triton Inference Server. For HTTP, it constructs JSON payloads and handles HTTP headers. For gRPC, it serializes your data into Protobuf messages and manages the gRPC channel. The client’s job is to make the server’s inference endpoint feel like a unified API, regardless of the transport protocol.

The InferenceServerClient (for HTTP) and InferenceServerClient (for gRPC) classes are your main entry points. You instantiate them with the server’s address. Then, you define your inputs using infer_input, specifying the name, shape, and data type. Crucially, you populate these inputs with your actual data using methods like set_data_from_numpy. The infer method is where the magic happens – it sends the request and returns a results object. Finally, you extract your model’s outputs from this results object using infer_output and as_numpy.

The key difference you’ll encounter when switching between HTTP and gRPC is often in the server configuration and port numbers. HTTP typically runs on port 8000, while gRPC uses port 8001. You also need to ensure your Triton server is launched with the appropriate endpoints enabled (--http-address and --grpc-address). The client library abstracts this, but understanding the server’s setup is vital for troubleshooting.

One thing that often trips people up is how data types are handled. While you specify FP32, INT32, etc., when creating the input object, the underlying serialization needs to be precise. The set_data_from_numpy method is convenient because NumPy handles many of these conversions, but if you’re working with raw bytes or other complex data structures, you might need to be more explicit about byte order and memory layout to ensure the data arrives at Triton exactly as expected.

The next logical step is to explore asynchronous requests, which are critical for maximizing throughput when dealing with multiple concurrent inference calls.

Want structured learning?

Take the full Triton course →