Choosing the right vector index type is less about picking a "faster" one and more about understanding the fundamental trade-offs between search accuracy, memory usage, and build/query latency.
Let’s see an index in action. Imagine we have a dataset of 100,000 128-dimensional vectors, each representing a small image. We want to find the 10 nearest neighbors for a new query vector.
from annoy import AnnoyIndex
import numpy as np
# Generate dummy data
num_vectors = 100000
dimension = 128
vectors = np.random.rand(num_vectors, dimension).astype('float32')
query_vector = np.random.rand(dimension).astype('float32')
# --- Flat Index ---
# Build a flat index (brute-force)
index_flat = AnnoyIndex(dimension, 'euclidean')
for i, vec in enumerate(vectors):
index_flat.add_item(i, vec)
index_flat.build(10) # n_trees for flat index is more about build time/memory trade-off
index_flat.save('flat_index.ann')
# Search the flat index
# This will scan ALL vectors, calculating the distance to each one.
# It's guaranteed to find the exact nearest neighbors.
t_flat = AnnoyIndex(dimension, 'euclidean')
t_flat.load('flat_index.ann')
neighbors_flat, distances_flat = t_flat.get_nns_by_vector(query_vector, 10, include_distances=True)
print(f"Flat index found {len(neighbors_flat)} neighbors.")
# --- HNSW Index ---
# Build an HNSW index
# HNSW uses a graph-based approach, building layers of connections.
# Higher ef_construction leads to a more accurate graph but longer build times.
# ef_search controls the search space exploration.
index_hnsw = AnnoyIndex(dimension, 'euclidean')
for i, vec in enumerate(vectors):
index_hnsw.add_item(i, vec)
index_hnsw.build(10, ef_construction=40, ef_search=16) # Example parameters
index_hnsw.save('hnsw_index.ann')
# Search the HNSW index
t_hnsw = AnnoyIndex(dimension, 'euclidean')
t_hnsw.load('hnsw_index.ann')
# Search will traverse the graph. ef_search determines how many nodes to explore.
neighbors_hnsw, distances_hnsw = t_hnsw.get_nns_by_vector(query_vector, 10, include_distances=True)
print(f"HNSW index found {len(neighbors_hnsw)} neighbors.")
# --- IVF (Inverted File) Index - Conceptual Example (not directly in Annoy, but common) ---
# Imagine we partition vectors into 'n_clusters' using k-means.
# For a query, we find the closest cluster centroids and only search vectors within those clusters.
# This drastically reduces the search space but introduces approximation.
# Parameters: n_clusters, n_probe (how many clusters to search).
# --- PQ (Product Quantization) Index - Conceptual Example ---
# PQ further compresses vectors by splitting them into sub-vectors and quantizing each sub-vector.
# This reduces memory but makes distance calculations approximate.
# Often combined with IVF (IVF-PQ).
The core problem these indexes solve is the "curse of dimensionality." As vector dimensions increase, the distance between any two random vectors tends to become very similar, making exact nearest neighbor search computationally prohibitive. Indexes are approximations that sacrifice perfect accuracy for speed and memory efficiency.
Flat Index: This is your baseline, the brute-force method. Every vector is stored as is. When you query, the system calculates the distance from your query vector to every single vector in the index.
- Pros: 100% accurate recall. Simple to understand.
- Cons: Extremely slow and memory-intensive for large datasets. Scales linearly with dataset size and dimension.
- When to use: Very small datasets where accuracy is paramount and performance is not a concern, or as a ground truth for evaluating other indexes.
HNSW (Hierarchical Navigable Small Worlds): This is a graph-based approach. It builds a multi-layered graph where each layer is a "navigable small world" (meaning you can reach any node from any other node relatively quickly). Higher layers have fewer nodes and longer links, while lower layers have more nodes and shorter links. Search starts at a high layer and greedily traverses down to the lower layers.
- Pros: Excellent balance of speed, accuracy, and memory usage. Very fast query times. Good recall.
- Cons: Can have higher memory overhead than some other approximate methods. Build times can be significant, especially with high
ef_construction. - Key Parameters:
M: The maximum number of connections per node. HigherMmeans more memory but potentially better graph connectivity.ef_construction: Controls the quality of the graph during build. Higher values lead to better search performance but longer build times.ef_search: Controls the size of the search scope during queries. Higher values increase accuracy but also latency.
- When to use: General-purpose approximate nearest neighbor search where you need fast queries and good recall without excessive memory use. This is often the default choice for many applications.
IVF (Inverted File Index): This method partitions your vector space into n_clusters (using an algorithm like k-means). Each cluster is a "voronoi cell." When you query, the system first finds the nearest cluster centroids to your query vector and then only searches within the vectors belonging to those identified clusters.
- Pros: Significantly reduces the search space, leading to faster queries. Lower memory footprint than flat indexes.
- Cons: Accuracy is dependent on the quality of clustering and the
n_probeparameter. Can miss neighbors if they are in a distant cluster. - Key Parameters:
n_clusters: The number of partitions (centroids) to create. More clusters mean finer granularity but potentially more overhead.n_probe: The number of nearest clusters to search. Higher values increase recall but also query latency.
- When to use: Large datasets where memory is a concern, and a slight drop in recall is acceptable for significant speed improvements. Often combined with other techniques like PQ.
PQ (Product Quantization): This is a compression technique. It splits each vector into M sub-vectors. Then, for each sub-vector, it runs k-means to create a codebook. Each sub-vector is then replaced by the ID of its nearest codebook entry. This drastically reduces the memory needed to store vectors. Distance calculations are then performed on these compressed representations.
- Pros: Extremely memory efficient. Can enable searches on datasets that wouldn’t fit in memory otherwise.
- Cons: Introduces approximation errors due to quantization, impacting accuracy. Distance calculations are approximate.
- Key Parameters:
M: Number of sub-vectors.k: Number of centroids per sub-vector codebook.
- When to use: When memory is the primary constraint, and you need to store billions of vectors. It’s almost always used in conjunction with IVF (IVF-PQ) for better performance.
The one thing most people don’t realize is that the "distance" computed with PQ is not the true Euclidean distance, but an approximation derived from comparing codebook entries. This approximation is faster but is the source of the accuracy loss.
The next rabbit hole is understanding how these indexes are typically combined, such as IVF-PQ, to get the best of both worlds: partitioning for speed (IVF) and compression for memory (PQ).