Vector Quantization: Reduce Memory with Product Quantization (2026)

Product Quantization is a clever way to compress high-dimensional vectors, allowing you to store and search massive datasets of embeddings in memory without needing a supercomputer.

Let’s see it in action. Imagine you have a bunch of 128-dimensional vectors, each representing an image or a piece of text. Storing these as float32 (4 bytes per dimension) would take 128 * 4 = 512 bytes per vector. If you have a billion vectors, that’s 512 GB! Product Quantization (PQ) can slash this down significantly.

Here’s a simplified Python example using the faiss library, a popular choice for efficient similarity search.

import numpy as np
import faiss

# 1. Generate some random 128-dimensional vectors
d = 128  # dimension
nb = 100000  # database size
nq = 100  # nb of queries
np.random.seed(1234)
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

# 2. Product Quantization setup
# We'll split the 128D vector into 8 sub-vectors of 16 dimensions each.
# Each sub-vector will be quantized using 256 centroids (8 bits).
m = 8   # number of subquantizers
nbit = 8 # number of bits per subquantizer (2^nbit = number of centroids)
k = 1 << nbit # number of centroids, 256 in this case

# Train the PQ codebook
# The training data is used to learn the centroids for each sub-vector.
# We'll use a subset of our vectors for training.
pq = faiss.ProductQuantizer(d, m, nbit)
pq.train(xb[:10000]) # Train on first 10000 vectors

# Encode the database vectors
# Each vector is now represented by a sequence of codes (integers).
# The size of each code is nbit bits, and there are m codes per vector.
# Total bits per vector = m * nbit.
# For our example: 8 * 8 = 64 bits = 8 bytes per vector.
# This is a massive reduction from 512 bytes!
codes = pq.compute_codes(xb, 0) # 0 means no compression, just the raw codes

# 3. Build an Index for fast search
# We'll use an IndexIVFPQ, which combines Inverted File (IVF) indexing
# with Product Quantization. IVF partitions the vector space into cells.
# When searching, we first find which cells contain the query vector,
# and then only search within those cells using PQ for distance calculations.

# First, train an IndexFlatL2 for the coarse quantizer
nlist = 100 # number of inverted lists (cells)
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbit)

# IndexIVFPQ needs to be trained to learn the coarse quantizer centroids.
# It's common to use the same training data as the PQ.
index.train(xb[:10000])

# Add the codes to the index
# The index stores the compressed codes and associates them with original vector IDs.
index.add(codes)

# 4. Search
# When searching, the query vector is first assigned to a cell.
# Then, the compressed representation of the query vector is compared
# against the compressed representations of the database vectors within that cell.
# This uses the PQ's approximate distance calculations (e.g., Asymmetric Distance).
k_search = 4 # number of nearest neighbors to find
index.nprobe = 10 # number of cells to search (trade-off between speed and accuracy)
D, I = index.search(xq, k_search) # D = distances, I = indices of nearest neighbors

print("Search results (indices of nearest neighbors for first query):")
print(I[0])
print("\nSearch results (distances for first query):")
print(D[0])

The core idea behind Product Quantization is to decompose a high-dimensional vector into several lower-dimensional sub-vectors. Instead of quantizing the whole vector at once, we quantize each sub-vector independently.

Let’s say your original vector is v in R^d. We split v into m sub-vectors: v = [v_1, v_2, ..., v_m], where each v_i has dimension d/m. For each sub-vector v_i, we train a separate codebook of k centroids. This codebook is essentially a lookup table of k representative vectors for that specific sub-space. The training involves taking a large set of vectors, splitting them, and clustering each set of sub-vectors (e.g., using k-means) to find these k centroids.

Once trained, each v_i is replaced by the ID of its nearest centroid in its respective codebook. So, an original d-dimensional float vector (e.g., 512 bytes) becomes a compact representation of m centroid IDs. If we use 8 bits per ID (meaning k=256 centroids per sub-vector), the total compressed size is m * 8 bits. For m=8 and d=128, this is 8 * 8 = 64 bits, or just 8 bytes per vector. This is a massive memory saving.

The magic of the memory reduction comes from this decomposition. If you had a single k-means model for all d dimensions, you’d need a huge number of centroids to capture sufficient detail, leading to exponential memory growth. By splitting the problem, you train smaller, independent quantizers. The "product" in Product Quantization refers to the fact that the overall space of codes is the Cartesian product of the individual sub-quantizer spaces. The distance between two compressed vectors is then computed by summing up the distances between their corresponding sub-vector centroids.

The critical insight that most people miss is how the distances are approximated during search. When you have a query vector q and want to find its nearest neighbors among compressed database vectors y, you don’t decompress y. Instead, you compute the distance between q and the centroids that represent y. A common and efficient method is Asymmetric Distance Computation (ADC). For each sub-vector q_i of the query q, you pre-compute the distances to all k centroids in the i-th codebook. Then, for a database vector y represented by codes c_1, ..., c_m, the approximate squared Euclidean distance d(q, y)^2 is calculated as the sum of pre-computed distances: sum(dist(q_i, centroid(c_i))^2 for i=1 to m). This lookup-and-sum operation is much faster than full distance calculation on decompressed vectors.

The next step after mastering Product Quantization is often exploring how to combine it with other indexing structures for even faster and more scalable search, such as Hierarchical Navigable Small Worlds (HNSW) or optimizing the parameters m and nbit for your specific dataset and accuracy requirements.

More Deep Dives in Vector Databases