TensorRT Depthwise Separable Conv: Optimize Efficiently (2026)

Depthwise separable convolutions are a cornerstone of efficient deep learning, but TensorRT’s optimization of them can feel like a black box.

Let’s look at a typical inference scenario. Imagine a MobileNetV2 model running on an NVIDIA GPU with TensorRT. We’re processing a batch of 16 images, each 224x224 pixels with 3 color channels, through a layer employing a depthwise separable convolution.

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

# Model parameters (simplified for illustration)
input_shape = (16, 3, 224, 224)
filters_in = 32
filters_out = 64
kernel_size = 3
stride = 1
padding = 1

# Create a TensorRT logger
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

# Build a TensorRT engine
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()

# Input tensor
input_tensor = network.add_input(name="input", dtype=trt.float32, shape=input_shape)

# Depthwise convolution
depthwise_kernel_shape = (filters_in, 1, kernel_size, kernel_size)
depthwise_kernel = trt.Weights(np.random.rand(*depthwise_kernel_shape).astype(np.float32))
depthwise_layer = network.add_deconvolution(input=input_tensor, num_output_maps=filters_in, kernel_shape=(kernel_size, kernel_size), weights=depthwise_kernel, bias=trt.Weights(np.array([], dtype=np.float32))) # Bias is usually 0 for depthwise
depthwise_layer.kernel_stride = (stride, stride)
depthwise_layer.padding = (padding, padding)
depthwise_layer.num_groups = filters_in # Crucial for depthwise

# Pointwise convolution
pointwise_kernel_shape = (filters_out, filters_in, 1, 1)
pointwise_kernel = trt.Weights(np.random.rand(*pointwise_kernel_shape).astype(np.float32))
pointwise_layer = network.add_deconvolution(input=depthwise_layer.get_output(0), num_output_maps=filters_out, kernel_shape=(1, 1), weights=pointwise_kernel, bias=trt.Weights(np.random.rand(filters_out).astype(np.float32)))

# Mark output
network.mark_output(pointwise_layer.get_output(0))

# Build the engine
engine = builder.build_engine(network, config)

# Execute the engine (simplified)
context = engine.create_execution_context()
inputs, outputs, bindings, stream = common.allocate_buffers(engine)
# ... (fill inputs with data, execute, retrieve outputs)

This code snippet illustrates the core structure: a depthwise convolution where num_groups equals the number of input channels, followed by a pointwise convolution. TensorRT’s optimization magic happens when it fuses these operations.

The fundamental problem depthwise separable convolutions solve is reducing computational cost. A standard convolution of size K x K with C_in input channels and C_out output channels has a complexity of O(K*K*C_in*C_out*H*W). A depthwise separable convolution breaks this into two steps:

Depthwise Convolution: Applies a single filter per input channel. Complexity: O(K*K*C_in*H*W).
Pointwise Convolution: A 1x1 convolution to combine information across channels. Complexity: O(C_in*C_out*H*W).

The total complexity is roughly O(K*K*C_in*H*W + C_in*C_out*H*W), which is significantly less than the standard convolution, especially for large kernel sizes (K) and channel counts (C_in, C_out).

TensorRT’s primary optimization for depthwise separable convolutions is layer fusion. It recognizes the pattern of a depthwise convolution immediately followed by a pointwise convolution and fuses them into a single, highly optimized kernel. This avoids intermediate memory writes and reads, leading to substantial performance gains. The specific fusion strategy depends on the target GPU architecture, TensorRT version, and available optimizations (like INT8 or FP16 precision).

The internal representation within TensorRT might transform the DepthwiseConv -> PointwiseConv sequence into a single FusedConv or Convolution layer with specific parameters that achieve the same result. This fusion is automatically handled by the builder. You don’t explicitly tell TensorRT to fuse them; it detects the pattern.

The "levers" you control are primarily the network definition itself and the build configuration.

Network Definition: Ensuring the depthwise convolution has num_groups set to the number of input channels is non-negotiable. The kernel_size for the depthwise part should be greater than 1 (e.g., 3x3), and the pointwise part must have a 1x1 kernel.
Build Configuration:
- config.set_flag(trt.BuilderFlag.FP16) or config.set_flag(trt.BuilderFlag.INT8): Using lower precision can significantly speed up computation, and TensorRT is highly optimized for these modes with depthwise separable convolutions.
- config.max_workspace_size: While not directly controlling the depthwise separable conv fusion, a larger workspace can enable more aggressive optimizations and fusions.
- config.avg_timing_iterations: Setting this helps TensorRT profile layers accurately, which can influence its fusion decisions.

The most surprising thing about TensorRT’s optimization of depthwise separable convolutions is how it amortizes the cost of the pointwise 1x1 convolution. When fused, the 1x1 convolution isn’t computed as a separate GEMM (General Matrix Multiply) operation after the depthwise part has written its output. Instead, the pointwise weights are effectively broadcast or interleaved within the processing of the depthwise output, allowing the GPU’s compute units to work on both aspects of the separable convolution simultaneously. This avoids the memory bandwidth bottleneck that a naive sequential execution would incur for the 1x1 step.

If you’re using onnxruntime with the TensorRT execution provider and observe that your depthwise separable convolutions aren’t benefiting as much as expected, double-check that the ONNX graph correctly represents the depthwise convolution with group attribute set to the input channel count and the subsequent pointwise convolution as a 1x1 convolution. Sometimes, framework exporters can misrepresent this structure, preventing TensorRT from recognizing the pattern for fusion.

More Deep Dives in Tensorrt