Quantization in TensorRT-LLM isn’t just about making models smaller; it’s about unlocking performance by using lower-precision numbers without sacrificing much accuracy.

Let’s see AWQ in action. Imagine a large language model. We’re going to load it using TensorRT-LLM, but with AWQ quantization applied.

import tensorrt_llm
from tensorrt_llm.quantization import QuantMode
from tensorrt_llm.models import PretrainedConfig

# Assume 'model_dir' points to a Hugging Face model directory
model_dir = "path/to/your/hf/model"
config = PretrainedConfig.from_json_file(os.path.join(model_dir, "config.json"))

# Load the model with AWQ quantization
# This is a conceptual representation; actual API calls might differ
# We're looking for INT4 weights and INT8 activations
llm_engine = tensorrt_llm.LLMEngine.from_pretrained(
    model_dir,
    quant_mode=QuantMode.AWQ,
    quant_policy={
        "weight_dtype": "int4",
        "activation_dtype": "int8"
    },
    # other engine build parameters...
)

# Now, when you run inference with llm_engine,
# the computations will leverage the INT4 weights and INT8 activations.
# The TensorRT-LLM runtime handles the dequantization and computation.

The core problem AWQ addresses is that naive quantization often degrades performance on large models. Specifically, weights that have a large dynamic range can be very sensitive to quantization. AWQ’s trick is to identify and protect the most "salient" weights – those that have the biggest impact on the model’s output – by not quantizing them as aggressively, or by applying a scaling factor before quantization. This selective quantization means you get the speedup of lower precision for most weights, while preserving the critical ones.

SmoothQuant, on the other hand, takes a different approach. Instead of selectively protecting weights, it aims to make the entire weight distribution more amenable to quantization. It does this by smoothing out the activation outliers.

import tensorrt_llm
from tensorrt_llm.quantization import QuantMode
from tensorrt_llm.models import PretrainedConfig

# Assume 'model_dir' points to a Hugging Face model directory
model_dir = "path/to/your/hf/model"
config = PretrainedConfig.from_json_file(os.path.join(model_dir, "config.json"))

# Load the model with SmoothQuant quantization
# This is a conceptual representation; actual API calls might differ
# We're looking for INT8 weights and INT8 activations
llm_engine = tensorrt_llm.LLMEngine.from_pretrained(
    model_dir,
    quant_mode=QuantMode.SMOOTHQUANT,
    quant_policy={
        "weight_dtype": "int8",
        "activation_dtype": "int8"
    },
    # other engine build parameters...
)

# During the SmoothQuant calibration phase (often done during engine building),
# TensorRT-LLM analyzes activation distributions. It identifies outlier values
# and applies a smoothing factor to reduce their dynamic range.
# This factor is then applied to the corresponding weights, effectively
# "moving" the outlier problem from activations to weights, which are
# generally more robust to quantization.

SmoothQuant works by recognizing that activation outliers are a primary cause of quantization error. Before quantizing weights to INT8, it looks at the activations and identifies channels where the values are unusually high. It then applies a scaling factor to these outlier activations, effectively lowering their peak values. This scaling factor is then applied inversely to the corresponding weights. The net effect is that the weights become "smoother" – their dynamic range is reduced – making them much easier to quantize to INT8 with minimal accuracy loss.

The key difference is where the effort is focused. AWQ identifies and protects critical weights, while SmoothQuant smooths out the entire activation distribution to make all weights easier to quantize. Both aim for INT4/INT8 weights and INT8 activations for significant performance gains.

The surprising part about these quantization schemes is how they can achieve such high performance with INT4 precision for weights. It’s not just about reducing the number of bits; it’s about intelligently distributing the quantization error. AWQ, by analyzing weight importance, ensures that the most impactful weights retain enough precision to prevent catastrophic accuracy drops. SmoothQuant, by managing activation outliers, makes the entire weight matrix more "quantization-friendly" so that even less critical weights can be aggressively reduced in precision.

When you build a TensorRT-LLM engine with these quantization modes, TensorRT itself takes over. It analyzes the model graph, identifies the layers and weight matrices suitable for quantization, and applies the chosen scheme. For AWQ, this involves identifying per-channel or per-group scaling factors and clipping thresholds during the calibration phase. For SmoothQuant, it involves calculating the smoothing factors for activations and applying the inverse to weights. The resulting engine then uses specialized kernels that can perform computations directly on the quantized weights and activations, significantly reducing memory bandwidth and compute requirements.

The choice between AWQ and SmoothQuant often comes down to the specific model architecture and the desired trade-off between accuracy and performance. Some models might benefit more from AWQ’s targeted weight protection, while others might see greater gains from SmoothQuant’s global activation smoothing. It’s also worth noting that both methods are typically applied during the engine building process, not at runtime. You select the quantization mode when you compile your model into a TensorRT engine.

A detail that often gets overlooked is how the choice of calibration dataset can influence the effectiveness of both AWQ and SmoothQuant. The statistical properties of the data used during the calibration phase directly impact the calculated scaling factors and outlier thresholds. A dataset that doesn’t accurately reflect the real-world inference distribution can lead to suboptimal quantization, even with advanced techniques.

The next step after mastering quantization is often exploring different tensor parallelism strategies within TensorRT-LLM to further distribute the workload across multiple GPUs.

Want structured learning?

Take the full Tensorrt course →