Marlin and AWQ are two cutting-edge techniques that let you run large language models (LLMs) on less hardware by quantizing their weights to INT4, but the real magic is how they achieve this with minimal performance loss, often even gaining speed.

Let’s see Marlin in action, quantizing a model and then running inference.

First, we need to install vLLM and the necessary dependencies.

pip install vllm

Now, let’s quantize a small model, say facebook/opt-125m, to INT4 using Marlin. We’ll use the quantize CLI tool that comes with vLLM.

python -m vllm.model.quantize facebook/opt-125m --output-dir opt-125m-marlin --quant-type int4 --group-size 128 --desc-act

This command takes the original facebook/opt-125m model, quantizes its weights to INT4 using Marlin’s algorithm with a group size of 128 and descriptor activation (which helps with performance), and saves the quantized model to opt-125m-marlin. The --desc-act flag is crucial for Marlin’s performance gains as it stores quantized activations in a format that’s faster to dequantize.

Once quantized, you can load and run inference with this model using the vLLM engine.

from vllm import LLM, SamplingParams

llm = LLM(model="opt-125m-marlin")
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

This script initializes an LLM with our Marlin-quantized model and then generates text for a few prompts. You’ll notice that the inference is surprisingly fast, even on consumer-grade hardware, because the INT4 weights require less memory bandwidth and computation.

AWQ (Activation-aware Weight Quantization) takes a slightly different approach. Instead of just quantizing weights, it analyzes the activations during a calibration phase to determine which weights are most important and thus should be protected from aggressive quantization.

Here’s how you might quantize a model using AWQ. Note that AWQ quantization often requires a calibration dataset. For simplicity, let’s assume we’re using a pre-quantized AWQ model for demonstration. If you were to quantize yourself, you’d typically use a command like this:

python -m vllm.model.quantize facebook/opt-125m --output-dir opt-125m-awq --quant-type awq --wbits 4 --awq-group-size 128 --calib-data dummy_data_path

(Note: dummy_data_path would be replaced with an actual path to your calibration data).

Then, loading and running inference with an AWQ-quantized model is identical to the Marlin example:

from vllm import LLM, SamplingParams

llm = LLM(model="opt-125m-awq") # Assuming you have an AWQ model saved here
prompts = [
    "Explain the concept of quantum entanglement in simple terms:",
    "Write a short story about a robot who discovers emotions:",
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

The core problem both Marlin and AWQ solve is the massive memory footprint and computational cost of LLMs. Standard FP16 or BF16 models require significant VRAM, making them inaccessible for many users. Quantization to INT4 reduces the model size by 4x, but naive INT4 quantization often leads to a drastic drop in accuracy and perplexity.

Marlin achieves its speed and accuracy by employing a clever dequantization strategy. Instead of dequantizing each weight individually, it uses a lookup table and processes weights in larger blocks. The --desc-act flag is key here: it ensures that activations are also quantized in a way that makes the dequantization process during the forward pass extremely efficient. It’s essentially pre-processing the activations to match the quantized weights’ structure for faster computation. The weights are stored in a custom format that allows for very fast matrix multiplications when combined with the specially formatted activations.

AWQ, on the other hand, focuses on preserving the salient weights. It identifies a small percentage (e.g., 1%) of weights that have the most significant impact on the model’s output by observing activation magnitudes. These important weights are then quantized less aggressively (or not at all), while the rest are quantized to INT4. This selective quantization strategy helps maintain model accuracy while still benefiting from the memory and speed advantages of INT4. The algorithm finds a scaling factor for each group of weights that minimizes the quantization error given the expected activations, hence "activation-aware."

A critical detail often overlooked is how these quantization formats interact with different hardware. While INT4 is the target precision, the actual computation might involve temporary dequantization to a higher precision (like INT8 or even FP16) for the matrix multiplication kernel. Both Marlin and AWQ are optimized to make this dequantization step as fast as possible, often leveraging specific hardware instructions. For Marlin, the descriptor activation format is designed to minimize the overhead of this dequantization before the core GEMM operation. For AWQ, the preservation of critical weights means that even if some dequantization occurs, the most impactful parts of the model are less affected, leading to better overall quality.

The performance difference between Marlin and AWQ can depend on the specific model architecture and the hardware. Marlin often shows superior speed due to its highly optimized dequantization path and descriptor activation format, especially when the hardware can efficiently process this format. AWQ, by focusing on preserving accuracy through selective quantization, might offer slightly better perplexity scores on certain models, though its speed is also very competitive.

The next step after mastering INT4 quantization is exploring techniques for even more aggressive quantization or optimizing inference for specific hardware accelerators.

Want structured learning?

Take the full Vllm course →