TensorRT Performance Profiling: Nsight Systems Analysis (2026)

NVIDIA’s TensorRT is a powerful SDK for high-performance deep learning inference, but understanding why it’s fast (or not fast enough) requires looking under the hood.

Let’s say you’ve got a TensorRT engine running, and you suspect a performance bottleneck. You’ve tried optimizing your model, quantizing it, and maybe even tweaking TensorRT’s own builder configurations, but you’re still not hitting your latency targets. The next step is to profile the entire system that’s running your inference, not just the TensorFlow or PyTorch model itself. This is where NVIDIA Nsight Systems comes in.

Nsight Systems is a system-wide performance analysis tool. It lets you visualize the execution of your application across CPUs, GPUs, and other system components over time. It’s invaluable for pinpointing where time is spent, whether it’s in your application code, CUDA kernels, TensorRT’s internal operations, or even OS-level scheduling.

Here’s a typical workflow:

Install Nsight Systems: Download and install the latest version from the NVIDIA developer website. Make sure your NVIDIA driver is also up-to-date.
Prepare Your Application: Ensure your application is built with debug symbols enabled (-g for GCC/Clang) if you want to see C++ source code correlation. For CUDA, it’s often enabled by default, but check your build flags.
Launch Nsight Systems:
- On Linux, open a terminal and run:
```
nsys profile --trace=cuda,cudnn,osrt,nvtx -o my_profile_output <your_application> <your_application_args>
```
  - --trace=cuda,cudnn,osrt,nvtx: This is crucial.
    - cuda: Traces CUDA API calls and kernel executions.
    - cudnn: Traces cuDNN API calls, which TensorRT heavily relies on.
    - osrt: Traces operating system runtime events (like thread scheduling, syscalls).
    - nvtx: Traces NVIDIA Tools Extension (NVTX) ranges. We’ll use this to mark our TensorRT inference calls.
  - -o my_profile_output: Specifies the output file name.
  - <your_application> <your_application_args>: Your actual inference application and its arguments.
Add NVTX Ranges (Optional but Highly Recommended): To easily identify your inference loop within the Nsight Systems timeline, wrap your inference calls with NVTX.
- C++ Example:
```
#include <nvtx3/nvToolsExt.h>

// ... inside your inference loop ...
nvtxRangePushA("TensorRT Inference");
// Your TensorRT inference code here (e.g., engine->enqueueV2, stream.synchronize())
nvtxRangePopA();
// ...
```
- Python Example (using pycuda or direct C API wrappers): You’ll need to find a Python binding for NVTX or use a library that automatically instruments CUDA calls. If you’re using torch2trt or similar, it might already add NVTX ranges.
Analyze the .nsys-rep File:
- Open the generated my_profile_output.nsys-rep file in the Nsight Systems GUI.
- Timeline View: This is your primary canvas. You’ll see rows for CPU threads, GPU activities, CUDA API calls, etc.
- Identify your NVTX Range: Look for the "TensorRT Inference" range (or whatever you named it) on the timeline. This marks the total time spent in your inference operation.
- Zoom In: Click and drag on the timeline to zoom into the area of interest.
- GPU Worker Threads: Observe the GPU activity. You’ll see kernels being launched, their duration, and the time spent waiting for the GPU.
- CUDA API Calls: Look at the CUDA row. Are there long gaps between API calls? Is cudaStreamSynchronize taking a significant amount of time? This indicates your CPU is waiting for the GPU to finish.
- CPU Threads: Examine the CPU threads involved in inference. Are they busy with computation, or are they waiting on locks, I/O, or synchronization primitives?
- cuDNN Calls: If you traced cudnn, you’ll see calls like cudnnConvolutionForward. These are the building blocks TensorRT uses. You can see how long each of these takes.

Common Bottlenecks and How to Spot Them:

GPU Underutilization:
- Diagnosis: On the timeline, you’ll see large gaps of idle time on the GPU during your "TensorRT Inference" NVTX range. The GPU Utilization metric in the summary might also be low.
- Cause: The CPU is not feeding the GPU work fast enough. This could be due to inefficient data loading, pre-processing bottlenecks on the CPU, or the CPU thread responsible for launching kernels is blocked.
- Fix: Optimize your data loading pipeline (e.g., using torch.utils.data.DataLoader with multiple workers), move pre-processing to the GPU if possible, or ensure your inference submission loop is efficient.
- Why it works: The GPU is a highly parallel processing unit. If it’s waiting for instructions or data, its power is wasted. Keeping it busy with actual work is key.
Excessive CUDA Synchronization:
- Diagnosis: Look for long cudaStreamSynchronize calls within the CUDA API row, especially if they fall within your NVTX inference range.
- Cause: Your application is explicitly waiting for the GPU to finish its current work before proceeding. This often happens if you’re not properly overlapping computation with data transfer or if you’re not using asynchronous operations correctly.
- Fix: Ensure you’re using CUDA streams correctly. Launch kernels asynchronously and only synchronize when absolutely necessary (e.g., at the end of a batch for results, or before the next batch’s data transfer if there’s a dependency). Try to overlap data transfers (e.g., cudaMemcpyAsync) with kernel execution.
- Why it works: cudaStreamSynchronize is a blocking call. By minimizing these, you allow the CPU to prepare the next set of work or perform other tasks while the GPU is still busy.
CPU Pre-processing/Post-processing Bottlenecks:
- Diagnosis: Observe the CPU threads that execute your application code. If a specific thread is showing very high CPU utilization and its activity directly precedes or follows the "TensorRT Inference" NVTX range, that’s your culprit.
- Cause: Complex data augmentation, image resizing, normalization, or deserialization of model outputs on the CPU are taking longer than the GPU inference itself.
- Fix:
  - Batching: Process multiple inference requests in parallel on the CPU to better utilize its cores.
  - GPU Acceleration: Offload pre/post-processing steps to the GPU using libraries like CUDA, cuDNN, or even by writing custom CUDA kernels.
  - Efficient Libraries: Use optimized libraries for image manipulation (e.g., OpenCV with CUDA support) or data handling.
- Why it works: Moving computationally intensive tasks from a potentially overloaded CPU to the parallel processing power of the GPU, or simply making the CPU work more efficiently, reduces the overall time spent outside of GPU inference.
TensorRT Kernel Overheads / Small Kernels:
- Diagnosis: Within the GPU activity, you might see many very short-running kernels. The total time spent launching kernels and the overhead between them can become significant.
- Cause: TensorRT might be breaking down complex operations into many smaller kernels, or your model architecture leads to frequent, small kernel launches.
- Fix:
  - TensorRT Builder Optimization: Experiment with IOptimizationProfile and IBuilderConfig settings. Ensure setMemoryPoolLimit is set appropriately.
  - Engine Fusion: TensorRT usually fuses layers to reduce kernel launch overhead. If it’s not fusing as expected, it could be due to incompatible layer types or specific builder configurations.
  - Model Architecture: Sometimes, a fundamental change in model structure (e.g., fewer, larger convolutional layers) can help.
- Why it works: Reducing the number of kernel launches and increasing the work per kernel reduces the overhead associated with GPU scheduling and kernel invocation.
Memory Bandwidth Limitations:
- Diagnosis: While less directly visible without specific GPU counters enabled, high GPU utilization coupled with slow kernel execution times, and particularly long cudaMemcpy operations, can indicate memory bandwidth as a bottleneck. Nsight Systems can often show memory transfer rates.
- Cause: The GPU is spending too much time waiting for data to be transferred from or to its global memory. This is common in models with large weight matrices or large feature maps.
- Fix:
  - Quantization: Using INT8 or FP16 precision significantly reduces memory footprint and bandwidth requirements.
  - Kernel Optimization: TensorRT’s kernel auto-tuner tries to select the best kernels for your hardware, which can impact memory access patterns.
  - Data Layout: Ensure your tensors are in the optimal layout (e.g., NCHW vs. NHWC) for the operations being performed.
- Why it works: By reducing the amount of data that needs to be moved or by moving it more efficiently, the GPU can spend more time on computation.
CPU Thread Scheduling / Contention:
- Diagnosis: In the OS Runtime traces, you might see threads frequently being put to sleep, swapped out, or waiting for locks (mutex waits). This can be seen as gaps in CPU thread activity.
- Cause: The CPU threads responsible for managing inference are being preempted by other system processes, or they are contending for shared resources.
- Fix:
  - Affinity: Pinning inference threads to specific CPU cores using taskset or sched_setaffinity.
  - Prioritization: Increasing the priority of your inference threads.
  - Reduce Concurrency: If you have too many threads fighting for resources, sometimes reducing the number of worker threads can improve performance.
- Why it works: Ensuring your critical inference threads have dedicated CPU time and minimal interruptions allows them to execute their tasks without being starved by the operating system scheduler or other processes.

After addressing these, you might find your next bottleneck is in the network if you’re running a distributed inference setup, or perhaps in the storage layer if you’re loading models dynamically.

More Deep Dives in Tensorrt