TensorRT batch size is the most counterintuitive knob you have for throughput, often leading people to believe larger is always better when the opposite is true at higher utilization.

Let’s see this in action. We’ll use a simple image classification engine, say, ResNet-50, and profile its throughput with different batch sizes.

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import time

# Initialize TensorRT logger
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

# Build a simple ResNet-50 engine (replace with your actual engine path)
# For demonstration, we'll assume an engine exists at 'resnet50.plan'
# In a real scenario, you'd build this using trt.Builder
try:
    with open('resnet50.plan', 'rb') as f:
        engine = trt.Runtime(TRT_LOGGER).deserialize_cuda_engine(f.read())
except FileNotFoundError:
    print("Error: resnet50.plan not found. Please build a TensorRT engine first.")
    exit()

# Get input and output tensor shapes
context = engine.create_execution_context()
input_binding_shape = engine.get_binding_shape(0)
output_binding_shape = engine.get_binding_shape(1)

# Allocate device memory for inputs and outputs
input_dtype = trt.nptype(engine.get_binding_dtype(0))
output_dtype = trt.nptype(engine.get_binding_dtype(1))

input_size = trt.volume(input_binding_shape) * input_dtype.itemsize
output_size = trt.volume(output_binding_shape) * output_dtype.itemsize

d_input = cuda.mem_alloc(input_size)
d_output = cuda.mem_alloc(output_size)

# Create a stream for asynchronous operations
stream = cuda.Stream()

# Prepare dummy input data (e.g., for batch size 1)
# Assuming input shape is (batch_size, C, H, W) and dtype is float32
# For ResNet-50, typically (batch_size, 3, 224, 224)
dummy_input_data = np.random.rand(*input_binding_shape).astype(input_dtype)
h_input = cuda.pagelocked_empty(input_size, input_dtype)
np.copyto(h_input, dummy_input_data.ravel())

# Function to run inference and measure throughput
def run_inference(batch_size, num_iterations=100):
    # Adjust input shape for the current batch size if needed
    # Note: TensorRT engines are typically built for a fixed batch size.
    # If your engine is built for dynamic shapes, you'd set it here.
    # For this example, we assume the engine is built for a specific batch size.
    # If your engine supports dynamic batching, you'd do:
    # context.set_binding_shape(0, (batch_size, *input_binding_shape[1:]))
    # For simplicity, we'll simulate by creating data for the engine's expected batch size.

    # Re-allocate memory if batch size changes and causes size difference
    # This is a simplification. Real dynamic batching handles this more gracefully.
    if input_binding_shape[0] != batch_size:
        print(f"Warning: Engine built for batch size {input_binding_shape[0]}, simulating with batch size {batch_size}.")
        # In a real dynamic batch engine, you'd adjust context here.
        # For this example, we'll just use the dummy_input_data assuming it matches the engine's build batch size.

    # Prepare input data for the engine's expected batch size
    h_input_for_engine = cuda.pagelocked_empty(input_size, input_dtype)
    # Fill with random data, assuming the engine expects this shape
    np.copyto(h_input_for_engine, np.random.rand(*input_binding_shape).astype(input_dtype).ravel())

    cuda.memcpy_htod_async(d_input, h_input_for_engine, stream)

    start_time = time.time()
    for _ in range(num_iterations):
        # Execute the engine
        context.execute_async_v2(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
        stream.synchronize() # Ensure completion before next iteration
    end_time = time.time()

    avg_latency = (end_time - start_time) / num_iterations
    throughput = num_iterations / (end_time - start_time)
    return avg_latency, throughput

# Profile different batch sizes
batch_sizes_to_test = [1, 4, 8, 16, 32, 64] # Example batch sizes

print("Profiling TensorRT throughput for ResNet-50:")
print("-" * 60)
print(f"{'Batch Size':<15} {'Avg Latency (ms)':<20} {'Throughput (inf/sec)':<20}")
print("-" * 60)

for bs in batch_sizes_to_test:
    # Important: If your engine is NOT built for dynamic batching, you must ensure
    # the engine's input batch dimension matches the data you provide.
    # The current `run_inference` assumes the engine's build batch size is used for data.
    # For true dynamic batching, you'd set context.set_binding_shape(0, (bs, *input_binding_shape[1:]))
    # and ensure memory is allocated appropriately for 'bs'.
    avg_lat, tp = run_inference(bs)
    print(f"{bs:<15} {avg_lat * 1000:<20.4f} {tp:<20.2f}")

print("-" * 60)

# Clean up
del context
del engine
del d_input
del d_output

The core problem TensorRT batch size solves is parallelism within a single inference request. When you increase the batch size, you’re essentially telling TensorRT to process multiple independent inputs simultaneously. This is fantastic for GPU utilization because GPUs excel at parallel computation. Instead of one stream of work, you have batch_size streams running side-by-side.

This allows you to amortize the overhead of kernel launches, memory transfers, and the inherent latency of the model across multiple inputs. For small batch sizes (like 1), the GPU might be sitting idle waiting for the next operation, or the computation is so small it doesn’t saturate the core. With a larger batch, you fill up those compute units.

However, there’s a sweet spot. As you increase the batch size, you also increase the total memory footprint for that batch. This can lead to:

  1. Memory Bandwidth Saturation: The GPU has to read and write more data for each batch. If the model’s computation is memory-bound (which many deep learning models are), you’ll hit a wall where the GPU is waiting for data to be fetched from or written to global memory.
  2. Compute Saturation: Eventually, you have enough work to keep all the GPU cores busy. Further increasing the batch size won’t give you more parallelism for the same computation; it just means each "iteration" of your loop is doing more work, which increases latency.
  3. Register Pressure: Larger batches mean more intermediate activations need to be stored. This increases the demand on the GPU’s registers, which can lead to register spilling, significantly slowing down execution.

The point where throughput stops increasing and starts decreasing is often the optimal batch size for maximum throughput. It’s a delicate balance between filling the GPU’s compute units and memory bandwidth without exceeding them, and also without introducing excessive latency or register pressure.

The most surprising thing about optimizing batch size is that the optimal value is heavily dependent on the specific model architecture, the target hardware (GPU), and even the input data dimensions. A batch size of 32 might be perfect for one model on a V100, while a batch size of 16 or 64 might be better for another model or a different GPU like an A100.

You’ll notice that as the batch size increases from 1, throughput generally climbs. Then, it will plateau and eventually start to drop. This drop-off is your signal that you’ve gone too far. The goal is to find the peak of that curve.

The one thing most people don’t realize is that the "optimal" batch size is not a fixed characteristic of the hardware or the model in isolation, but rather a characteristic of the system at peak utilization. It’s the batch size that most effectively amortizes the fixed overheads of kernel launches and memory transfers over the variable computation and memory access costs of the model, until those costs themselves become the bottleneck.

The next logical step after finding your peak throughput batch size is to explore techniques like model precision optimization (FP16, INT8) which can further increase throughput by reducing memory bandwidth requirements and compute load.

Want structured learning?

Take the full Tensorrt course →