The core problem is that your TensorRT inference engine is producing NaN (Not a Number) or Inf (Infinity) values, which are corrupting your model’s output and leading to accuracy degradation. This usually indicates a numerical instability somewhere in the computation graph.
Common Causes and Fixes for NaN/Inf in TensorRT
-
Numerical Instability in Operations: Certain operations, especially those involving logarithms, divisions, or exponentiations of very small or large numbers, can easily lead to
NaNorInf.- Diagnosis: Examine your model’s layers. Layers like
Log,Exp,Softmax,Sigmoid, or divisions by potentially zero values are prime suspects. If you can, inspect the intermediate outputs of these layers using a debugger or by logging values during inference. - Fix: For
Log, add a small epsilon:log(x + epsilon)instead oflog(x). For division, ensure the denominator is never zero:x / (y + epsilon). ForSoftmaxandSigmoid, numerical stability is often built-in, but if custom implementations are used, ensure they handle edge cases. A common epsilon value is1e-6or1e-7. - Why it works: Adding a small epsilon prevents the input to sensitive functions from becoming exactly zero or negative (for log) or the denominator from becoming exactly zero, thus avoiding undefined mathematical operations.
- Diagnosis: Examine your model’s layers. Layers like
-
Quantization Issues: If you’re using INT8 or FP16 quantization, the reduced precision can exacerbate numerical issues that might not appear in FP32.
- Diagnosis: Run inference in FP32 mode. If
NaN/Infdisappears, quantization is likely the culprit. Check your calibration data. If the calibration data doesn’t adequately represent the range of values seen during actual inference, the quantization scales can be incorrect. - Fix:
- Calibration Data: Ensure your calibration dataset is diverse and representative of real-world inference inputs. A common mistake is using too small a calibration set or one that doesn’t cover extreme values.
- Activation Clipping: Some quantization techniques allow clipping activation ranges. If your model has very wide activation ranges, clipping them to a reasonable bound (e.g.,
[-10, 10]) before quantization can help. - Re-run Calibration: Re-run the calibration process with improved data or settings.
- Why it works: Correctly scaled quantization prevents values from being mapped to ranges that exceed the representable limits of the lower precision type, or to values that become unstable when processed by subsequent layers.
- Diagnosis: Run inference in FP32 mode. If
-
Large Input Values: Extremely large input values can propagate through the network, leading to overflow in subsequent layers, especially those with large weights or activation functions.
- Diagnosis: Profile your input data. Look for outliers or unusually large values. You can also log the maximum absolute value of activations after each layer during inference.
- Fix: Normalize or clip your input data to a more reasonable range before feeding it into the model. For example, if your input is expected to be in
[0, 1], but you’re seeing values up to1000, consider scaling them down. - Why it works: Prevents intermediate computations from exceeding the maximum representable floating-point values, thus avoiding
Inf.
-
Batch Size: While less common, certain operations can become numerically unstable with specific batch sizes, especially if they interact poorly with parallel processing or memory layouts.
- Diagnosis: Test inference with different batch sizes (e.g., 1, 2, 4, 8). If
NaN/Infappears only for certain batch sizes, this is a strong indicator. - Fix: If a specific batch size is problematic, try to adjust your inference pipeline to use a different batch size that doesn’t trigger the issue. Often, batch size 1 is the most stable for debugging.
- Why it works: Differences in execution paths or memory access patterns for different batch sizes can sometimes expose or mitigate underlying numerical sensitivities.
- Diagnosis: Test inference with different batch sizes (e.g., 1, 2, 4, 8). If
-
Custom Layers or Operations: If your TensorRT engine includes custom layers (plugins) that are not fully robust, they can be a source of
NaN/Inf.- Diagnosis: If you’re using custom plugins, first try to disable them or replace them with equivalent standard layers if possible. If the issue persists with standard layers, the problem is likely elsewhere. If the issue only appears with custom plugins, then focus your debugging there.
- Fix: Thoroughly review the implementation of your custom plugin for numerical stability. Ensure all operations within the plugin handle edge cases (division by zero, log of zero, etc.) gracefully. Test the plugin in isolation with boundary input values.
- Why it works: Custom code bypasses the rigorous testing and numerical stability checks that come with optimized built-in TensorRT operations.
-
TensorRT Version or Build Configuration: Rarely, a specific bug in a TensorRT version or a particular build configuration (e.g., compiler flags) can lead to numerical issues.
- Diagnosis: Check the release notes for your TensorRT version for any known numerical stability issues. Try building your TensorRT engine with different optimization profiles or precision modes (e.g.,
FP16vs.TF32vs.FP32). - Fix: Consider upgrading or downgrading your TensorRT version. Ensure your build environment (CUDA, cuDNN, compiler) is compatible with the TensorRT version you’re using.
- Why it works: Updates often include bug fixes and improvements to numerical stability in core kernels.
- Diagnosis: Check the release notes for your TensorRT version for any known numerical stability issues. Try building your TensorRT engine with different optimization profiles or precision modes (e.g.,
The next error you’ll likely encounter after fixing NaN/Inf is a slight discrepancy in output values compared to your original framework (like PyTorch or TensorFlow), often due to the inherent precision differences between frameworks and TensorRT’s optimizations.