Quantization isn’t about making models smaller, it’s about making them faster by running them on hardware that natively understands lower precision numbers.

Let’s see this in action. We’ll take a simple Keras model and quantize it using TensorFlow Lite.

import tensorflow as tf
import numpy as np

# 1. Define a simple Keras model
def create_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(32,)),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    return model

model = create_model()
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Generate some dummy data
dummy_x = np.random.rand(100, 32).astype(np.float32)
dummy_y = np.random.randint(0, 10, 100)

# Train the model for a few epochs (necessary for post-training quantization)
model.fit(dummy_x, dummy_y, epochs=2, batch_size=32)

# Convert the trained Keras model to TensorFlow Lite format
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# --- INT8 Quantization ---
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Define a representative dataset for calibration
def representative_dataset_gen():
  for _ in range(100): # Use a subset of your training data
    yield [np.random.rand(1, 32).astype(np.float32)]

converter.representative_dataset = representative_dataset_gen

# Ensure integer only quantization
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8 # or tf.uint8
converter.inference_output_type = tf.int8 # or tf.uint8

tflite_quant_int8_model = converter.convert()

# Save the INT8 quantized model
with open('model_quant_int8.tflite', 'wb') as f:
    f.write(tflite_quant_int8_model)

print("INT8 quantized model saved as model_quant_int8.tflite")

# --- Float16 Quantization ---
# Reset converter for Float16
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16] # Specify Float16

tflite_quant_float16_model = converter.convert()

# Save the Float16 quantized model
with open('model_quant_float16.tflite', 'wb') as f:
    f.write(tflite_quant_float16_model)

print("Float16 quantized model saved as model_quant_float16.tflite")

# You can then load and run these models using the TFLite interpreter
# interpreter_int8 = tf.lite.Interpreter(model_path="model_quant_int8.tflite")
# interpreter_int8.allocate_tensors()
# ...

This script first defines a simple neural network, trains it briefly, and then uses tf.lite.TFLiteConverter to generate two quantized versions: one using INT8 precision and another using Float16. The INT8 conversion requires a "representative dataset" for calibration, which helps the converter determine the range of values for each layer to map them to the INT8 range. Float16, on the other hand, is a simpler conversion as it directly maps Float32 to Float16.

Post-training quantization is a method to reduce model size and latency by converting the weights and, optionally, the activations of a trained model to a lower-precision format, typically INT8 or Float16. The goal is to leverage hardware accelerators that perform computations much more efficiently on these lower-precision types compared to standard 32-bit floating-point numbers. INT8 quantization, specifically, can offer significant speedups and memory savings because INT8 operations are often natively supported and much faster on many mobile and edge CPUs and specialized AI accelerators (like NPUs or TPUs). Float16 quantization also provides benefits, offering a good balance between model size reduction and accuracy preservation, and is well-supported on GPUs.

The tf.lite.TFLiteConverter is your primary tool. For INT8 quantization, you must provide a representative_dataset. This dataset, a small sample of your training or validation data, is used by the converter to analyze the distribution of activations and weights. It calculates the minimum and maximum values (the range) for each tensor. This range is crucial for defining the mapping from the original Float32 range to the INT8 range (-128 to 127 or 0 to 255). Without this calibration step, the converter wouldn’t know how to accurately represent the model’s dynamic values in a fixed 8-bit integer format, leading to severe accuracy degradation. The converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] line tells the converter to prioritize INT8 operations, and inference_input_type and inference_output_type specify that the model’s inputs and outputs should also be quantized to INT8.

Float16 quantization is generally more straightforward. You enable it by setting converter.target_spec.supported_types = [tf.float16]. This tells the converter to convert all Float32 weights to Float16. Activations can also be converted to Float16 if the target hardware supports it efficiently. The primary benefit here is reduced model size (weights take up half the space) and potential speedups on hardware that has faster Float16 computation units (like many GPUs). It typically results in less accuracy loss than INT8 quantization because Float16 still retains a much wider dynamic range and precision compared to INT8.

The key trade-off with quantization, especially INT8, is accuracy. While INT8 offers the most significant performance gains, it can lead to a noticeable drop in model accuracy if not applied carefully. The calibration step with the representative dataset is vital for mitigating this. For Float16, the accuracy impact is usually much smaller. You often need to experiment and evaluate the quantized models on a hold-out dataset to ensure the performance gains are worth any potential accuracy decrease.

The most surprising fact about post-training quantization is that for many models, especially those with ReLU activations and dense layers, the accuracy drop after INT8 quantization is often negligible, sometimes even improving due to implicit regularization effects from clipping outliers. The converter is remarkably good at finding a suitable mapping if you provide it with a representative dataset that covers the typical input distribution.

The next step after quantization is often exploring quantization-aware training, which integrates the quantization process directly into the model training loop.

Want structured learning?

Take the full Tensorflow course →