HNSW and IVF are the two main families of vector indexes, and picking between them isn’t just about speed; it’s about how you want your approximate nearest neighbor search to fail.
Let’s see HNSW in action. Imagine we have a thousand vectors, each 128 dimensions, representing some kind of data. We want to build an index to quickly find vectors similar to a query vector.
import numpy as np
from hnswlib import Index
# Generate some random data
num_vectors = 1000
dim = 128
data = np.random.rand(num_vectors, dim).astype('float32')
# Initialize HNSW index
# M: number of neighbors per node, a higher M means more connections, better accuracy, slower build, larger index
# ef_construction: size of the dynamic list for the candidate set during construction
p = Index(space='l2', dim=dim)
p.init_index(max_elements=num_vectors, ef_construction=200, M=16)
# Add data to the index
labels = np.arange(num_vectors)
p.add_items(data, labels)
# Set ef for search (larger ef means more accurate but slower search)
p.set_ef(50)
# Query the index
query_vector = np.random.rand(dim).astype('float32')
k = 10 # Number of nearest neighbors to find
labels, distances = p.knn_query(query_vector, k=k)
print(f"Found {k} nearest neighbors for query vector. Labels: {labels}, Distances: {distances}")
This code sets up an HNSW index. You initialize it with M and ef_construction. M is like the branching factor in a tree, determining how many other nodes each node connects to. Higher M means more connections, better accuracy, but a slower build and a larger index. ef_construction controls how thoroughly the index is built. Then, during search, you set ef. A higher ef means the search explores more paths, leading to higher accuracy but slower query times.
IVF, or Inverted File Index, works on a different principle. It partitions the vector space into clusters, and then for each query, it only searches the clusters closest to the query vector.
from faiss import IndexFlatL2, IndexIVFFlat
import numpy as np
# Generate some random data
num_vectors = 1000
dim = 128
data = np.random.rand(num_vectors, dim).astype('float32')
# Train an IndexFlatL2 (this is the base quantizer for IVF)
quantizer = IndexFlatL2(dim)
# Initialize IndexIVFFlat
# nlist: number of Voronoi cells (clusters)
# m: number of sub-quantizers (for PQ, not used here directly but good to know)
index_ivf = IndexIVFFlat(quantizer, dim, nlist=10) # nlist=10 means 10 clusters
# IVF needs to be trained to learn the cluster centroids
index_ivf.train(data)
# Add data to the index
index_ivf.add(data)
# Set nprobe for search (number of cells to probe)
# nprobe controls the trade-off between speed and accuracy for IVF
index_ivf.nprobe = 5 # Probe 5 nearest cells
# Query the index
query_vector = np.random.rand(dim).astype('float32').reshape(1, -1) # Faiss expects 2D arrays for queries
k = 10
distances, labels = index_ivf.search(query_vector, k)
print(f"Found {k} nearest neighbors for query vector. Labels: {labels[0]}, Distances: {distances[0]}")
Here, nlist defines the number of clusters (Voronoi cells). The index is "trained" to find these cluster centroids. During search, nprobe determines how many of these clusters are examined. More nprobe means more accuracy, but slower search.
The core difference lies in their approximation strategies. HNSW builds a graph where nodes are vectors and edges represent proximity. Search traverses this graph, greedily moving towards the query vector, but with a probabilistic element controlled by ef. It’s good at finding very close neighbors but can miss slightly further ones if the graph isn’t dense enough or ef is too low. IVF, on the other hand, partitions the space. It’s excellent when your data has distinct clusters, as it can quickly narrow down the search to relevant regions. Its failure mode is missing relevant vectors that fall into un-probed clusters.
If you have highly uniform data where density is consistent across the vector space, HNSW often shines. It can maintain high recall across the board with proper tuning. For data that naturally forms distinct groups or clusters, IVF can be significantly faster, especially with a large number of vectors, as it dramatically reduces the search space. The choice of M in HNSW and nlist in IVF are critical. A high M in HNSW leads to a more connected graph, increasing build time and memory but improving search accuracy. A high nlist in IVF means smaller clusters, potentially faster search if nprobe is small, but also requires more training data and can lead to more overlap between cluster searches.
Many people think of HNSW’s ef_construction and ef as directly analogous to IVF’s nlist and nprobe, but the underlying mechanisms are quite different. ef_construction influences the structure of the HNSW graph, while ef influences the search path within that structure. nlist defines the partitions of the vector space for IVF, and nprobe dictates how many of those partitions are scanned. A common mistake is to set nlist too low in IVF, leading to very large clusters where many vectors are packed together, and then expecting nprobe=1 to be sufficient. This often results in poor recall because the single probed cluster contains too much noise.
The next thing you’ll likely grapple with is how to optimize these parameters for your specific dataset and recall requirements, often involving extensive experimentation.