The most surprising thing about TensorRT calibration is that the dataset itself is far more impactful on INT8 accuracy than the specific calibration algorithm you choose.

Let’s see it in action. Imagine we’ve trained a model and now want to quantize it to INT8 for faster inference with TensorRT. We have a trained model.onnx and a directory of images (/data/calibration_images/).

First, we need a Python script to generate the calibration data. This script will load our model, run inference on a batch of calibration images, and capture the activation distributions.

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) as network, builder.create_builder_config() as config:
    # Parse ONNX model
    parser = trt.OnnxParser(network, TRT_LOGGER)
    with open("model.onnx", "rb") as model:
        if not parser.parse(model.read()):
            print("Failed to parse ONNX file")
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            exit()

    # Set up calibration
    calibration_stream = cuda.Stream()
    calibration_cache = "calibration.cache"

    # Create a calibration object. We'll use the MinMaxCalibrator for this example,
    # but the dataset is the key.
    class MyCalibrator(trt.IInt8MinMaxCalibrator):
        def __init__(self, calibration_data_dir, batch_size=16, cache_file="calibration.cache"):
            trt.IInt8MinMaxCalibrator.__init__(self)
            self.data_dir = calibration_data_dir
            self.batch_size = batch_size
            self.cache_file = cache_file
            self.image_list = [os.path.join(self.data_dir, img) for img in os.listdir(self.data_dir) if img.endswith(('.png', '.jpg', '.jpeg'))]
            self.current_index = 0
            self.calibration_data = np.zeros((batch_size, 3, 224, 224), dtype=np.float32) # Assuming NCHW, 224x224 input

            # Allocate device memory for input and output
            self.device_input = cuda.mem_alloc(self.calibration_data.nbytes)
            self.stream = cuda.Stream()

        def get_batch_size(self):
            return self.batch_size

        def get_batch(self, names):
            if self.current_index + self.batch_size > len(self.image_list):
                return None # No more batches

            for i in range(self.batch_size):
                img_path = self.image_list[self.current_index + i]
                # Load and preprocess image (resize, normalize, etc.)
                img = Image.open(img_path).convert('RGB')
                img = img.resize((224, 224))
                img_np = np.array(img, dtype=np.float32)
                # Assuming BGR input for example, adjust if needed
                img_np = img_np.transpose((2, 0, 1))
                img_np = img_np / 255.0 # Normalize to [0, 1]
                self.calibration_data[i] = img_np

            # Copy data to device
            cuda.memcpy_htod_async(self.device_input, self.calibration_data, self.stream)
            self.current_index += self.batch_size
            return [self.device_input] # Return device pointer for the input tensor

        def read_calibration_cache(self):
            if os.path.exists(self.cache_file):
                with open(self.cache_file, "rb") as f:
                    return f.read()
            return None

        def write_calibration_cache(self, cache):
            with open(self.cache_file, "wb") as f:
                f.write(cache)

    calibrator = MyCalibrator("/data/calibration_images/")

    # Configure TensorRT builder
    config.set_flag(trt.BuilderFlag.INT8)
    config.int8_calibrator = calibrator

    # Build the engine
    print("Building TensorRT engine with INT8 calibration...")
    serialized_engine = builder.build_serialized_network(network, config)
    if serialized_engine:
        with open("model_int8.trt", "wb") as f:
            f.write(serialized_engine)
        print("INT8 engine built successfully.")
    else:
        print("Failed to build INT8 engine.")

The problem this solves is the accuracy degradation that happens when you naively convert floating-point weights and activations to lower-precision integers. INT8 inference is much faster and uses less memory, but the limited range and precision of integers can lead to significant errors if not handled carefully. Calibration is TensorRT’s way of finding the optimal scaling factors to map floating-point ranges to integer ranges, minimizing this accuracy loss.

Internally, TensorRT needs to understand the distribution of activation values for each layer in your network. For INT8 quantization, it determines a scaling factor for each layer. This scaling factor maps the full range of floating-point activations (e.g., -5.0 to 5.0) to the full range of INT8 integers (-127 to 127, or -128 to 127 depending on signedness). The goal is to choose scaling factors such that the quantized activations most closely resemble the original floating-point activations.

The calibrator object is where this happens. TensorRT calls get_batch repeatedly, asking for batches of data. Your calibrator reads these images, preprocesses them (resizing, normalization, etc. – crucial steps that must match your training preprocessing), and provides them to TensorRT. TensorRT then runs these batches through the network (in a special calibration mode) and collects statistics about the activation values in each layer. The read_calibration_cache and write_calibration_cache methods allow TensorRT to reuse previously computed calibration data, saving time on subsequent builds.

The different calibrator types (MinMax, Entropy, Legacy) primarily influence how TensorRT uses the collected activation distributions to derive the scaling factors. MinMax uses the absolute minimum and maximum observed values. Entropy uses a more sophisticated method that minimizes the KL divergence between the floating-point and quantized distributions. However, the quality and representativeness of the data provided by get_batch is paramount. If your calibration dataset doesn’t reflect the diversity and characteristics of the data your model will see in production, even the best calibration algorithm will fail to produce accurate INT8 results.

Here’s the one thing most people don’t know: if your calibration data is too small, or if it only covers a narrow subset of scenarios (e.g., only bright, clear images when your model should handle low-light conditions), your INT8 engine will likely perform poorly. TensorRT might pick scaling factors that are heavily skewed by outliers or by the limited range of your calibration set, leading to clipping or quantization noise that significantly degrades accuracy on real-world data. Aim for a calibration dataset that is at least as diverse as your expected inference data, ideally with thousands of samples.

The next concept you’ll run into is understanding and debugging INT8 accuracy regressions, which often involves comparing layer-wise activation distributions between FP32 and INT8 engines.

Want structured learning?

Take the full Tensorrt course →