TensorRT’s embedding table inference is surprisingly efficient because it treats embedding lookups as a highly parallelized matrix multiplication, not a series of individual key-value fetches.
Let’s see it in action. Imagine a recommendation system that needs to retrieve embeddings for user IDs and item IDs.
import tensorrt as trt
import numpy as np
# Assume these are pre-trained embedding weights and input IDs
# For simplicity, let's make them small
user_embedding_weights = np.random.rand(1000, 64).astype(np.float32) # 1000 users, 64-dim embeddings
item_embedding_weights = np.random.rand(10000, 128).astype(np.float32) # 10000 items, 128-dim embeddings
# Example input: a batch of user IDs and item IDs
user_ids = np.array([10, 5, 100, 2], dtype=np.int32)
item_ids = np.array([500, 12, 8000, 3], dtype=np.int32)
# --- TensorRT Build Process ---
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
# Input tensors
user_id_input = network.add_input(name="user_ids", dtype=trt.int32, shape=(user_ids.shape[0],))
item_id_input = network.add_input(name="item_ids", dtype=trt.int32, shape=(item_ids.shape[0],))
# Embedding layers
# TensorRT's Embedding layer is designed for this.
# The num_output_dims is the size of the embedding vector.
user_embedding_layer = network.add_embedding(
input=user_id_input,
num_output_dims=user_embedding_weights.shape[1],
num_embeddings=user_embedding_weights.shape[0]
)
user_embedding_layer.set_weights(user_embedding_weights)
item_embedding_layer = network.add_embedding(
input=item_id_input,
num_output_dims=item_embedding_weights.shape[1],
num_embeddings=item_embedding_weights.shape[0]
)
item_embedding_layer.set_weights(item_embedding_weights)
# The output of the embedding layer is the desired embedding vectors.
# For a recommendation system, you might concatenate these or perform further operations.
# Let's just mark them as outputs for demonstration.
user_embedding_layer.get_output(0).name = "user_embeddings"
item_embedding_layer.get_output(0).name = "item_embeddings"
network.mark_output(user_embedding_layer.get_output(0))
network.mark_output(item_embedding_layer.get_output(0))
# Build the engine
serialized_engine = builder.build_serialized_network(network, config)
runtime = trt.Runtime(TRT_LOGGER)
engine = runtime.deserialize_cuda_engine(serialized_engine)
# --- Inference ---
context = engine.create_execution_context()
# Allocate device memory
user_id_binding_idx = engine.get_binding_index("user_ids")
item_id_binding_idx = engine.get_binding_index("item_ids")
user_emb_binding_idx = engine.get_binding_index("user_embeddings")
item_emb_binding_idx = engine.get_binding_index("item_embeddings")
# Get input/output shapes and dtypes
user_id_shape = context.get_binding_shape(user_id_binding_idx)
item_id_shape = context.get_binding_shape(item_id_binding_idx)
user_emb_shape = context.get_binding_shape(user_emb_binding_idx)
item_emb_shape = context.get_binding_shape(item_emb_binding_idx)
user_id_dtype = engine.get_binding_dtype(user_id_binding_idx)
item_id_dtype = engine.get_binding_dtype(item_id_binding_idx)
user_emb_dtype = engine.get_binding_dtype(user_emb_binding_idx)
item_emb_dtype = engine.get_binding_dtype(item_emb_binding_idx)
# Create buffers
user_id_buffer = cuda.to_device(user_ids)
item_id_buffer = cuda.to_device(item_ids)
# Allocate output buffers on device
user_emb_output_buffer = cuda.device_malloc(np.prod(user_emb_shape) * np.dtype(trt.nptype(user_emb_dtype)).itemsize)
item_emb_output_buffer = cuda.device_malloc(np.prod(item_emb_shape) * np.dtype(trt.nptype(item_emb_dtype)).itemsize)
# Create bindings list
bindings = [None] * engine.num_bindings
bindings[user_id_binding_idx] = user_id_buffer
bindings[item_id_binding_idx] = item_id_buffer
bindings[user_emb_binding_idx] = user_emb_output_buffer
bindings[item_emb_binding_idx] = item_emb_output_buffer
# Execute inference
stream = cuda.Stream()
context.execute_async_v2(bindings=bindings, stream=stream.handle)
stream.synchronize()
# Copy results back to host
user_embeddings_result = np.empty(user_emb_shape, dtype=trt.nptype(user_emb_dtype))
item_embeddings_result = np.empty(item_emb_shape, dtype=trt.nptype(item_emb_dtype))
cuda.memcpy_dtoh_async(user_embeddings_result, user_emb_output_buffer, stream=stream)
cuda.memcpy_dtoh_async(item_embeddings_result, item_emb_output_buffer, stream=stream)
stream.synchronize()
print("User Embeddings Shape:", user_embeddings_result.shape)
print("Item Embeddings Shape:", item_embeddings_result.shape)
print("Sample User Embedding:", user_embeddings_result[0, :5])
print("Sample Item Embedding:", item_embeddings_result[1, :5])
This code snippet uses tensorrt.NetworkDefinition.add_embedding to create embedding layers within a TensorRT engine. The key is that add_embedding doesn’t just fetch individual rows from a table. Instead, when TensorRT builds the engine, it optimizes this operation. It fuses the embedding lookup with subsequent operations, often transforming it into a series of gather operations that are highly amenable to parallel execution on the GPU. Think of it as collecting all the required rows at once and then processing them as a batch, rather than fetching one by one. This is where the speedup comes from.
The problem this solves is the bottleneck of retrieving large embedding tables for millions of users and items during inference. Traditional methods might involve slow CPU lookups or inefficient GPU memory accesses. TensorRT’s approach packs these lookups into a single, optimized kernel.
Internally, TensorRT’s add_embedding layer takes an input tensor of indices and the embedding weights (which you provide via set_weights). It outputs a tensor where each row corresponds to the embedding vector for the input index. The num_output_dims specifies the dimensionality of the embedding vector, and num_embeddings is the total number of unique embeddings available (the size of your lookup table).
The levers you control are primarily:
- Embedding Dimensions (
num_output_dims): This directly impacts the size of your embedding vectors and thus the memory footprint and computational cost. - Number of Embeddings (
num_embeddings): This is the size of your lookup table. Larger tables require more memory. - Input Data Types: Ensure your input IDs are
int32and your weights arefloat32(orfloat16for potential speedups if supported and accuracy permits). - Batch Size: The input
user_idsanditem_idstensors represent a batch of requests. Larger batch sizes generally lead to better GPU utilization and throughput, up to a point.
When TensorRT optimizes the graph, it might fuse multiple add_embedding operations together, or fuse them with subsequent matrix multiplications if, for instance, you’re calculating dot products between user and item embeddings. The engine will generate a custom CUDA kernel that efficiently performs these batched lookups and potentially subsequent computations, minimizing GPU memory traffic and maximizing parallel processing. The actual mechanism involves clever indexing and potentially specialized memory access patterns to fetch contiguous blocks of embedding data where possible, rather than scattered reads.
The next concept you’ll likely encounter is optimizing the embedding weights themselves, perhaps using quantization or exploring mixed-precision inference to further reduce memory bandwidth and improve performance.