SmoothQuant makes LLMs run faster by quantizing their weights, but it only works if you get the scaling factors just right.
Let’s see SmoothQuant in action with a quick example. Imagine we have a simple linear layer in a PyTorch model that we want to quantize using SmoothQuant.
import torch
import torch.nn as nn
import tensorrt as trt
from polygraphy import TrtEngine, util
# Assume a simple model with a linear layer
class SimpleModel(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(128, 128)
def forward(self, x):
return self.linear(x)
model = SimpleModel()
model.eval()
# --- SmoothQuant Activation ---
# This is a simplified representation of the SmoothQuant process.
# In a real scenario, you'd compute these scales based on calibration data.
# Simulate per-tensor activation scales (usually per-channel for weights)
# These scales are derived from analyzing the activation distributions.
# Higher values mean the activation range is larger and needs more bits or scaling.
activation_scales = torch.ones(1) * 2.0
# Apply SmoothQuant to the weights of the linear layer
# The goal is to move the "difficulty" of quantization from activations to weights.
# We divide the weights by a factor derived from the activation scales.
# This factor is typically the nth root of the activation scales, where n is the number of layers.
# For simplicity here, we'll just use a direct scaling factor.
scale_factor = (activation_scales ** (1.0 / len(model.linear.weight.shape)))
model.linear.weight.data /= scale_factor
# --- TensorRT Quantization ---
# Now, we build a TensorRT engine with INT8 quantization enabled.
# TensorRT will use the already "smoothed" weights.
# Create a dummy input
dummy_input = torch.randn(1, 128).cuda()
# Build the TensorRT engine
EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(EXPLICIT_BATCH)
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 28) # 256MiB
# Add the linear layer to the TensorRT network
# We need to explicitly define the input and output tensors.
# For INT8 precision, the weights should already be quantized or in a format
# that TensorRT can directly use for INT8 operations.
# SmoothQuant prepares weights such that they can be more easily quantized to INT8.
# The weights are already modified by SmoothQuant. TensorRT will quantize them further
# for its INT8 kernels.
# We need to register the weights as a constant tensor.
# For demonstration, we'll assume the weights are float32 and TensorRT will quantize them.
# Define input tensor
input_tensor = network.add_input("input", trt.float32, (1, 128))
# Create a TensorRT weight object from the modified PyTorch weights
# TensorRT expects weights in a specific format.
# For INT8, it will perform the quantization based on its own calibration or
# the pre-quantized values if provided.
# Since SmoothQuant has already "smoothed" them, they are more amenable to INT8.
# Convert the modified PyTorch weight tensor to a TensorRT weight
# The data type here should match what TensorRT expects for INT8 weight quantization,
# or it will perform the quantization. For simplicity, we'll pass float weights.
weight_data = model.linear.weight.cpu().numpy()
trt_weights = trt.Weights(weight_data)
# Create a TensorRT linear layer (fully connected layer)
# The output type is specified as INT8.
fc_layer = network.add_fully_connected(input_tensor, 128, trt_weights, None)
fc_layer.name = "smoothquant_fc"
fc_layer.precision = trt.int8 # Explicitly set precision
# Define output tensor
output_tensor = fc_layer.get_output(0)
output_tensor.name = "output"
network.mark_output(output_tensor)
# Set up quantization configuration for INT8
# This is crucial. Without it, TensorRT might not use INT8 for the layer.
# For INT8, TensorRT typically uses its own calibration mechanism.
# SmoothQuant aims to make this calibration process more effective.
config.set_flag(trt.BuilderFlag.INT8)
# In a real scenario, you'd provide a calibration stream here.
# For this example, we'll omit it and rely on TensorRT's default calibration,
# which might not be optimal but demonstrates the intent.
# calibration_stream = MyCalibrationStream(...)
# config.int8_calibrator = calibration_stream
# Build the engine
engine = builder.build_engine(network, config)
# --- Inference ---
# Now we can run inference with the INT8 engine.
# The weights in the engine are quantized to INT8.
if engine:
print("TensorRT engine built successfully with INT8 precision.")
context = engine.create_execution_context()
# Prepare input and output buffers
input_binding_idx = engine.get_binding_index("input")
output_binding_idx = engine.get_binding_index("output")
# Allocate device memory for input and output
d_input = torch.empty(1, 128, device="cuda", dtype=torch.float32)
d_output = torch.empty(1, 128, device="cuda", dtype=torch.float32) # Output type depends on layer definition
# Copy input data to device
d_input.copy_(dummy_input)
# Execute inference
bindings = [None] * engine.num_bindings
bindings[input_binding_idx] = d_input.data_ptr()
bindings[output_binding_idx] = d_output.data_ptr()
context.execute_v2(bindings=bindings)
# Copy output data back to host
output_data = d_output.cpu().numpy()
print("Inference completed. Output shape:", output_data.shape)
else:
print("Failed to build TensorRT engine.")
The core idea behind SmoothQuant is to shift the burden of quantization from activations to weights. Instead of quantizing activations, which are often dynamic and harder to represent with low precision, SmoothQuant applies a smoothing factor to the weights. This factor is derived from analyzing the distribution of activations. By dividing the weights by this factor, we effectively "boost" the signal-to-noise ratio for the weights, making them more amenable to aggressive quantization (like INT8) without significant performance degradation. TensorRT can then quantize these smoothed weights to INT8 more effectively, leading to faster inference. The "magic" is that by carefully choosing these scaling factors, you can maintain accuracy while gaining speed.
The critical levers you control are the calibration data and the smoothing factor calculation. The calibration data is used to gather statistics about the activation distributions. From these statistics, you compute the per-channel or per-tensor scaling factors. These scales are then used to adjust the weights. The specific formula for adjusting weights is often W_smooth = W / (S_act ^ (1/n)), where S_act is the activation scale and n is the number of layers (or a similar exponent to distribute the scaling effect across layers).
The most surprising thing about SmoothQuant is how it decouples the quantization of weights and activations, allowing for a more robust weight-only quantization strategy that significantly boosts performance in LLMs.
When you look at the implementation, you’ll see that SmoothQuant essentially performs a pre-processing step on the model’s weights before they are fed into a quantization-aware training framework or a tool like TensorRT. This pre-processing involves calculating scaling factors based on activation statistics collected during a calibration phase. These scaling factors are then applied to the weights. For example, a weight tensor W might be transformed into W' = W / (S_a ^ (1/k)), where S_a is a scaling factor derived from activation statistics and k is a parameter controlling how aggressively the scaling is applied. This transformation makes the weights more uniform in their distribution, enabling INT8 quantization without the typical accuracy drop associated with quantizing activations directly.
The exact way TensorRT ingests these "smoothed" weights is through its builder API. You provide the modified weights, and if INT8 precision is enabled via config.set_flag(trt.BuilderFlag.INT8), TensorRT will attempt to quantize these weights to INT8 using its internal calibration mechanisms or pre-quantized data. The success of SmoothQuant hinges on TensorRT being able to effectively quantize these pre-conditioned weights.
The one thing most people don’t know is that the "smoothing" process can be applied per-channel for weights, aligning with the typical structure of convolutional or linear layers where different channels have vastly different activation ranges. This fine-grained scaling is crucial for maximizing the benefits of SmoothQuant across diverse LLM architectures.
The next step after successfully implementing SmoothQuant with TensorRT is often exploring mixed-precision inference, where some layers might remain in FP16 while others are in INT8, to find the optimal balance between performance and accuracy.