The most surprising thing about TensorRT production monitoring is that the most critical performance indicators often aren’t found in the typical application logs, but buried deep within the NVIDIA driver’s statistics.

Let’s see what that looks like in action. Imagine you’ve deployed a real-time object detection model using TensorRT on an NVIDIA Jetson AGX Xavier. You’re seeing intermittent latency spikes, and your application logs just show "Inference complete" with varying timestamps. To understand why, you need to reach into the GPU’s own world.

Here’s a typical workflow. First, you’ll install nvidia-smi if it’s not already present. This is your primary tool for interacting with the GPU driver.

sudo apt-get update
sudo apt-get install nvidia-nvidia-settings # Often includes nvidia-smi

Now, to get a real-time stream of metrics, you’d run:

watch -n 0.5 nvidia-smi -q -d UTILIZATION,MEMORY,POWER,TEMPERATURE

This command, run every half-second (-n 0.5), queries (-q) the driver for detailed diagnostics (-d) on GPU utilization, memory usage, power draw, and temperature. You’ll see output like this, constantly updating:

        <output snipped for brevity>
        .
        .
        .
        Utilization:
          Gpu: 85 %
          Memory: 30 %
          Encoder: 0 %
          Decoder: 0 %
        Memory:
          Total Used: 2450 MiB
          Free: 5730 MiB
        Power:
          Default Limit: 15000 mW
          Current Power Draw: 12500 mW
        Temperature:
          Gpu: 65 C
        .
        .
        .

This raw data is your starting point. To build a robust monitoring system, you’ll want to scrape these metrics and push them to a time-series database like Prometheus. You can use a simple script to parse nvidia-smi output or leverage existing exporters like dcgm-exporter.

Here’s a snippet of how you might parse and expose metrics for Prometheus using dcgm-exporter (which simplifies this significantly):

First, install dcgm-exporter: https://developer.nvidia.com/dcgm

Then, configure it to collect the metrics you need. A dcgm-exporter configuration file (dcgm-exporter.conf) might look something like this:

# dcgm-exporter.conf
listen_port: 9445
gpu_instance_limit: 1
counters:
  - gpu.utilization
  - gpu.memory.used
  - gpu.power.draw
  - gpu.temperature

And then run the exporter:

./dcgm-exporter --config dcgm-exporter.conf

Now, Prometheus can scrape http://your-gpu-host:9445/metrics.

The core problem TensorRT monitoring solves is understanding why your inference isn’t meeting its latency targets or throughput requirements. It’s not just about the CPU or network; the GPU is a complex processing unit with its own bottlenecks.

Internally, TensorRT compiles your model into an optimized CUDA kernel. When you run inference, these kernels execute on the SMs (Streaming Multiprocessors) of the GPU. The gpu.utilization metric tells you how busy those SMs are. If it’s consistently high (e.g., >95%), your model is likely maxing out the GPU’s compute capacity. If it’s low, but latency is high, the bottleneck is elsewhere – perhaps memory bandwidth, or even CPU pre/post-processing.

gpu.memory.used is crucial. Large models or large batch sizes can exhaust GPU VRAM, leading to performance degradation or outright OOM errors. gpu.power.draw and gpu.temperature are indicators of thermal throttling. If the GPU hits its thermal limit, it will downclock its clock speeds, directly impacting performance.

The levers you control are primarily:

  • Batch Size: Increasing batch size can improve throughput by better utilizing GPU parallelism, but it also increases VRAM usage and can increase latency per inference if the GPU becomes compute-bound.
  • Model Precision: FP16 or INT8 precision (if supported by your model and hardware) can significantly boost performance and reduce VRAM usage compared to FP32.
  • TensorRT Optimization Level: TensorRT offers optimization levels (e.g., kINT8_OPT, kFP16_OPT) that trade off build time for runtime performance.
  • Hardware: The specific GPU model dictates its compute, memory, and power capabilities.

When setting up alerts, you’ll want to monitor for sustained high GPU utilization (e.g., gpu.utilization > 98% for 5m) as an indicator of a potential compute bottleneck that might require model optimization or a hardware upgrade. Conversely, low GPU utilization coupled with high latency might point to issues with data loading, pre/post-processing, or inefficient kernel launches. High temperatures (gpu.temperature > 80c for 2m) should trigger alerts for thermal throttling, indicating cooling issues or insufficient ventilation. Memory utilization alerts (gpu.memory.used > 90% for 2m) can warn of impending OOM errors.

A key insight often missed is that nvidia-smi reports average utilization over its polling interval. For very short, spiky workloads, you might miss brief periods of 100% utilization if your polling interval is too long. Using tools like nvprof or Nsight Systems for deeper profiling is essential for diagnosing these micro-optimizations, but for production monitoring, nvidia-smi and its exported metrics provide the necessary high-level view.

The next challenge is correlating these GPU metrics with application-level performance, often leading to distributed tracing implementations.

Want structured learning?

Take the full Tensorrt course →