TensorFlow Serving can expose models via both gRPC and REST APIs, but most people don’t realize that the gRPC API is the primary, foundational interface, and REST is essentially a translation layer on top.
Let’s see it in action. Imagine we have a simple TensorFlow model saved in a saved_model format, ready for serving. We’ll deploy it with TensorFlow Serving and then interact with it.
First, the model itself. Here’s a snippet of how you might save a simple model:
import tensorflow as tf
# Define a simple model
input_tensor = tf.keras.layers.Input(shape=(10,), name='input_layer')
dense_layer = tf.keras.layers.Dense(5, activation='relu')(input_tensor)
output_tensor = tf.keras.layers.Dense(1, activation='sigmoid', name='output_layer')(dense_layer)
model = tf.keras.models.Model(inputs=input_tensor, outputs=output_tensor)
# Save the model
export_path = "./my_model/1" # Version 1
tf.saved_model.save(model, export_path)
Now, we start TensorFlow Serving. The command to run it, assuming your model is in ./my_model and you want to expose it on default ports, looks like this:
tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=my_model --model_base_path=./my_model
Here, --port=8500 is for the gRPC API, and --rest_api_port=8501 is for the REST API. --model_name is the logical name of your model, and --model_base_path points to the directory containing versioned models.
Let’s query the gRPC endpoint first. We’ll use grpcurl, a command-line utility for interacting with gRPC services.
First, list the available services and methods:
grpcurl -plaintext localhost:8500 grpc.reflection.v1alpha.ServerReflection.ServerStreamingReflection
This will show you the structure. The key service is tensorflow.serving.PredictionService. To make a prediction, you’d construct a request like this (this is a conceptual representation, actual grpcurl command is more involved):
grpcurl -plaintext -d '{"model_spec":{"name":"my_model","version":{"value":1}},"inputs":{"input_layer":{"dtype":"DT_FLOAT","tensor_shape":{"dim":[{"size":10}]},"float_val":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]}}}' localhost:8500 tensorflow.serving.PredictionService.Predict
The output would be a JSON-like structure representing the prediction:
{
"outputs": {
"output_layer": {
"dtype": "DT_FLOAT",
"tensor_shape": {
"dim": [
{
"size": 1
}
]
},
"float_val": [
0.55234567
]
}
}
}
Now, the REST API. It’s designed to be more user-friendly for web applications. The same prediction request, but over HTTP POST to the REST port:
curl -d '{"instances": [[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]]}' http://localhost:8501/v1/models/my_model:predict
The response is similar, but formatted as standard JSON:
{
"predictions": [
[0.55234567]
]
}
Notice how the REST API’s /v1/models/my_model:predict endpoint maps to the gRPC Predict method. The instances field in the REST request corresponds to the inputs field in the gRPC request, with a slightly different structure for the tensor data. The REST API handles the conversion of float_val (for float tensors) to a simple list of lists.
The core problem TensorFlow Serving solves is decoupling model training from model inference. Instead of embedding models directly into application code or needing to manage complex deployment pipelines for every model update, you have a dedicated, high-performance service that can load and serve multiple versions of multiple models. It handles model versioning, rolling updates, and provides a consistent API for clients.
Internally, TensorFlow Serving uses a modular architecture. The ModelServer is the central process. It interacts with ServableRegistry to manage the lifecycle of Servable objects (your models). ModelLoader instances are responsible for loading models from various sources (like a filesystem). The PredictionService implementation then takes incoming requests (either gRPC or translated from REST) and dispatches them to the appropriate Servable for inference.
The gRPC interface is strictly typed using Protocol Buffers. This means the request and response structures are precisely defined, leading to high performance and robustness. The REST API, on the other hand, uses JSON, which is more human-readable and widely supported by web frameworks, but it requires a translation layer within TensorFlow Serving to convert JSON payloads into the gRPC protobuf format before they reach the model. This translation adds a small overhead, which is why gRPC is generally preferred for high-throughput, low-latency scenarios.
A key detail most people miss is how TensorFlow Serving manages model versions and availability. When you point model_base_path to a directory containing subdirectories named after version numbers (e.g., 1, 2, 3), TensorFlow Serving automatically discovers and loads these versions. It maintains a rolling update strategy by default: when a new version becomes available, it’s loaded and warmed up before traffic is directed to it. If the new version fails to load or a critical error occurs, it can gracefully roll back to the previous stable version. This ensures that model deployments are resilient.
The next concept you’ll likely encounter is how to manage multiple models and complex versioning strategies, including canary deployments and A/B testing, using TensorFlow Serving’s configuration and API.