The most surprising thing about TensorRT INT8 quantization is that it often improves accuracy, not just performance, by forcing your model to be more robust.

Let’s see it in action. Imagine you have a trained FP32 model, resnet50.onnx. You want to quantize it to INT8 for faster inference on NVIDIA GPUs. The core of this process is calibration.

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

def build_engine(model_path, calibration_data, batch_size=1):
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(EXPLICIT_BATCH)
    config = builder.create_builder_config()
    parser = trt.OnnxParser(network, TRT_LOGGER)

    with open(model_path, 'rb') as model:
        if not parser.parse(model.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            raise RuntimeError("Failed to parse ONNX model")

    # Configure INT8 calibration
    config.set_flag(trt.BuilderFlag.INT8)
    config.int8_calibrator = MyCalibrator(calibration_data, batch_size=batch_size)

    # Build the engine
    serialized_engine = builder.build_serialized_network(network, config)
    if serialized_engine is None:
        raise RuntimeError("Failed to build TensorRT engine")

    engine = trt.Runtime(TRT_LOGGER).deserialize_cuda_engine(serialized_engine)
    return engine

class MyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, calibration_data, batch_size=1):
        trt.IInt8EntropyCalibrator2.__init__(self)
        self.batch_size = batch_size
        self.calibration_data = calibration_data
        self.current_index = 0

        # Allocate GPU memory for calibration data
        self.device_input = cuda.mem_alloc(self.calibration_data.nbytes)

    def get_batch_size(self):
        return self.batch_size

    def get_batch(self, names):
        if self.current_index + self.batch_size > self.calibration_data.shape[0]:
            return None # No more batches

        batch = self.calibration_data[self.current_index : self.current_index + self.batch_size]
        cuda.memcpy_htod(self.device_input, batch.astype(np.float32))
        self.current_index += self.batch_size
        return [self.device_input]

    def read_calibration_cache(self):
        # If you have a pre-existing calibration cache, load it here.
        # Otherwise, return None to generate a new one.
        return None

    def write_calibration_cache(self, cache):
        # Save the generated calibration cache.
        with open("calibration.cache", "wb") as f:
            f.write(cache)

# --- Example Usage ---
# Assume you have calibration data loaded into numpy array `calib_data`
# calib_data = np.random.rand(100, 3, 224, 224).astype(np.float32) # Example shape for ResNet50
# batch_size = 4
# engine = build_engine("resnet50.onnx", calib_data, batch_size=batch_size)
# print("Engine built successfully with INT8 calibration.")

Calibration is TensorRT’s way of figuring out the best way to map the wide range of floating-point values in your FP32 model to the much smaller range of INT8 values. It does this by feeding representative data through the network and observing the activation distributions. The MyCalibrator class above is the heart of this. It provides batches of calibration data to TensorRT. The get_batch method is called repeatedly by TensorRT. get_batch_size tells TensorRT how many samples are in a batch. read_calibration_cache and write_calibration_cache allow for caching the calibration results, so you don’t have to re-calibrate every time if your data hasn’t changed.

The problem this solves is that INT8 has only 256 possible values, while FP32 has billions. A naive mapping would severely crush the dynamic range of activations, leading to massive accuracy loss. Calibration finds a per-tensor or per-channel scaling factor that best preserves the distribution of activations when quantized. TensorRT uses an entropy minimization algorithm (hence trt.IInt8EntropyCalibrator2) to find these scales. It essentially tries to minimize the information loss between the FP32 activations and their INT8 counterparts.

The key levers you control are:

  • calibration_data: This is paramount. It must be representative of your actual inference data. If your calibration data doesn’t reflect the real-world inputs your model will see, your INT8 model’s accuracy will suffer. The more diverse and voluminous this data, the better.
  • batch_size for calibration: A larger batch size can lead to more stable statistics for calibration but might also increase calibration time. Experiment to find a balance.
  • trt.BuilderFlag.INT8: This flag tells the builder to enable INT8 precision and the calibration process.
  • The calibrator implementation: trt.IInt8EntropyCalibrator2 is the default and generally recommended. You can implement custom logic if you have very specific needs, but it’s rarely necessary.

The surprising part about accuracy improvement comes from this: by forcing the model to use a limited range of values and finding scales that preserve essential information, quantization can act as a form of regularization. It can make the model less sensitive to small perturbations in weights or activations, leading to better generalization. It’s like finding the most "essential" signal in your data, discarding the noise.

When you run this, TensorRT will iterate through your calibration data, compute activation statistics, and generate a calibration.cache file. This cache stores the calculated scaling factors. The next time you build the engine with the same model and calibration data (or if you just load the engine), TensorRT will use this cache, skipping the calibration step.

Once you’ve successfully built your INT8 engine, the next hurdle is often dealing with layers that TensorRT can’t quantize automatically, leading to a mixed-precision engine or specific layer failures.

Want structured learning?

Take the full Tensorrt course →