TensorRT Sparsity: Accelerate Pruned Models (2026)

TensorRT Sparsity isn’t about making your models smaller; it’s about making them run faster by exploiting the shape of their weights.

Let’s see this in action. Imagine we have a dense layer and a sparse layer.

import torch
import torch.nn as nn
import tensorrt as trt
import numpy as np

# --- Create a simple dense model ---
class DenseModel(nn.Module):
    def __init__(self):
        super(DenseModel, self).__init__()
        self.fc = nn.Linear(1024, 1024)

    def forward(self, x):
        return self.fc(x)

# --- Create a sparse model (simulated) ---
# In practice, sparsity comes from pruning. Here, we'll manually create zero weights.
class SparseModel(nn.Module):
    def __init__(self):
        super(SparseModel, self).__init__()
        self.fc = nn.Linear(1024, 1024)
        # Simulate pruning by setting a large percentage of weights to zero
        with torch.no_grad():
            mask = torch.rand_like(self.fc.weight) > 0.8 # 80% sparsity
            self.fc.weight.data.mul_(mask)

    def forward(self, x):
        return self.fc(x)

dense_model = DenseModel()
sparse_model = SparseModel()

# --- TensorRT Inference ---
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)

# --- Build for Dense Model ---
EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network_dense = builder.create_network(EXPLICIT_BATCH)
config_dense = builder.create_builder_config()
parser_dense = trt.legacy.onnx.OnnxParser(network_dense, TRT_LOGGER)

# Export dense model to ONNX (requires onnx package)
# torch.onnx.export(dense_model, torch.randn(1, 1024).cuda(), "dense_model.onnx", verbose=False)
# with open("dense_model.onnx", "rb") as model:
#     if not parser_dense.parse(model.read()):
#         print('ERROR: Failed to parse the ONNX file (dense)')
#         for error in range(parser_dense.num_errors):
#             print(parser_dense.get_error(error))

# --- Build for Sparse Model ---
network_sparse = builder.create_network(EXPLICIT_BATCH)
config_sparse = builder.create_builder_config()
parser_sparse = trt.legacy.onnx.OnnxParser(network_sparse, TRT_LOGGER)

# Export sparse model to ONNX
# torch.onnx.export(sparse_model, torch.randn(1, 1024).cuda(), "sparse_model.onnx", verbose=False)
# with open("sparse_model.onnx", "rb") as model:
#     if not parser_sparse.parse(model.read()):
#         print('ERROR: Failed to parse the ONNX file (sparse)')
#         for error in range(parser_sparse.num_errors):
#             print(parser_sparse.get_error(error))

# Placeholder for actual ONNX parsing and engine building
# The key is that TensorRT *detects* sparsity during the build process.
# If you build an engine from an ONNX file of a pruned model, TensorRT will
# automatically try to leverage sparsity if it recognizes the pattern.

print("TensorRT engine building process initiated for both dense and sparse models.")
print("TensorRT's optimization phase will automatically detect and exploit sparsity for the sparse model if supported.")

# Example of checking sparsity in ONNX (conceptual, not direct TensorRT API)
# ONNX can represent sparse tensors, and TensorRT parses this information.
# For weights that are exactly zero, TensorRT can apply specialized kernels.

# To actually *see* the difference, you'd build the engines and time them.
# For the sparse engine, TensorRT would use kernels that skip computations
# involving zero weights, directly reducing FLOPs and memory bandwidth.

This code sets up two identical models, one dense and one where we’ve manually zeroed out 80% of the weights to simulate pruning. The magic happens when TensorRT builds an engine from a model that has this actual sparsity. It doesn’t just load the weights; it analyzes them. If it finds large contiguous blocks of zeros or a high overall sparsity, it can select specialized, faster kernels.

The problem TensorRT Sparsity solves is that traditional deep learning hardware and software are optimized for dense matrix operations. When you prune a model (setting many weights to zero), you create a sparse matrix. Processing a sparse matrix with dense operations is wasteful: you still perform multiplications and additions with zeros, consuming compute and memory bandwidth for no gain. TensorRT Sparsity identifies these zero weights and, where possible, uses algorithms that skip these zero-valued computations.

Internally, TensorRT identifies sparsity during the engine building phase. It analyzes the weight tensors of layers like linear and convolutional layers. If a layer exhibits a significant percentage of zero weights (often above a threshold like 50% or 70%, depending on the kernel and configuration), TensorRT can choose to use a sparse kernel for that layer. These sparse kernels are specifically designed to avoid reading zero weights from memory and performing multiplications by zero, leading to substantial speedups. The choice of kernel is made by the TensorRT optimizer, which tries to find the best trade-off between computation time, memory usage, and accuracy.

The primary lever you control is the pruning strategy applied to your model before exporting it to ONNX or a format TensorRT can ingest. TensorRT itself doesn’t prune; it accelerates already pruned models. The degree of sparsity, the patterns of sparsity (e.g., unstructured vs. structured), and the specific layers being pruned all influence how effectively TensorRT can exploit sparsity. For unstructured sparsity (randomly scattered zeros), TensorRT uses techniques like compressed sparse row (CSR) or similar formats internally for certain operations. For structured sparsity (e.g., entire rows or columns of weights being zero), the benefits can be even more pronounced as it might allow for entire operations to be skipped.

What most people don’t realize is that the sparsity detection and kernel selection are highly automated. You don’t typically manually tell TensorRT "use a sparse kernel for this layer." Instead, you provide a pruned model, and TensorRT’s optimizer, as part of its exhaustive search for the fastest execution plan, will automatically select sparse kernels if they are available for the detected sparsity pattern and offer a performance benefit. This means the acceleration is largely "free" once you’ve done the pruning and built the engine.

The next concept you’ll run into is how to effectively prune your models to maximize TensorRT’s sparsity benefits without significantly hurting accuracy.

More Deep Dives in Tensorrt