The most surprising thing about Approximate Nearest Neighbor (ANN) search is that it’s fundamentally a trade-off between speed and accuracy, and the best algorithms actively lie to you about the true nearest neighbors.

Imagine you have a massive collection of song embeddings – each song represented by a high-dimensional vector capturing its sonic characteristics. You want to find songs similar to a given query song. A brute-force approach would compare the query vector to every single song in your database, calculating the distance. This is exact but prohibitively slow for millions or billions of songs. ANN algorithms, like Hierarchical Navigable Small Worlds (HNSW) and Inverted File Index (IVF), offer a shortcut.

Let’s see HNSW in action. HNSW builds a multi-layered graph. At the top layer, you have a sparse graph connecting distant points. As you descend through the layers, the graphs become denser, connecting closer neighbors. When you search, you start at a random node in the top layer and greedily traverse the graph, always moving to the neighbor that is closest to your query vector. You repeat this process, descending through layers, until you can’t find a closer neighbor in the current layer. The algorithm then drops down to the next layer and continues the greedy search.

Here’s a simplified representation of how a search might look in a 2D space (though real embeddings are hundreds of dimensions):

Query Vector: [0.5, 0.5]

HNSW Graph (Layer 2 - Sparse):
Nodes: A (0.1, 0.1), B (0.9, 0.9), C (0.2, 0.8), D (0.8, 0.2)
Edges: A-B, C-D

HNSW Graph (Layer 1 - Denser):
Nodes: A, B, C, D, E (0.6, 0.6), F (0.4, 0.4)
Edges: A-B, C-D, A-F, B-E, E-F, F-A

HNSW Graph (Layer 0 - Densest):
Nodes: A, B, C, D, E, F, G (0.51, 0.49), H (0.49, 0.51)
Edges: ... (many more connections)

Search Path:
1. Start at random node in Layer 2, say A.
2. Query [0.5, 0.5] is closer to B (distance ~0.57) than C (distance ~0.71) or D (distance ~0.71). Move to B.
3. From B, no other node in Layer 2 is closer. Drop to Layer 1.
4. In Layer 1, from B, E (distance ~0.14) is closer than A (distance ~0.57) or F (distance ~0.42). Move to E.
5. From E, F (distance ~0.14) is closer. Move to F.
6. From F, G (distance ~0.01) is closer. Drop to Layer 0.
7. In Layer 0, from G, H (distance ~0.02) is closer. Move to H.
8. From H, no other node in Layer 0 is closer. Stop.
Nearest Neighbors found: G, H

The core problem HNSW solves is efficiently navigating a high-dimensional space. By building these layered graphs, it prunes vast portions of the search space at each step, dramatically reducing the number of distance calculations. The "navigable small world" aspect means that even though the graph is huge, any node can be reached from any other node with a relatively small number of hops, thanks to the layered structure.

IVF, on the other hand, partitions your data into clusters (or "cells") using a clustering algorithm like k-means. Each cluster is represented by a centroid. When you search, you first find the few clusters closest to your query vector. Then, you only search within those selected clusters for the nearest neighbors.

Consider our song embeddings again. If we have 1000 clusters, and our query song’s embedding is closest to centroids of cluster #5, cluster #23, and cluster #87, we only compare the query vector to the songs within those three clusters, not the entire database. This is the "Inverted File" part: you’re inverting the mapping, so instead of "song -> vector," you have "cluster ID -> list of songs in that cluster."

The exact levers you control in HNSW are efConstruction (how many neighbors to keep during graph construction, higher means better graph quality but slower build) and efSearch (how many candidate neighbors to explore during search, higher means more accuracy but slower search). For IVF, it’s n_clusters (how many clusters to create, more clusters means smaller search space but potentially more overhead) and n_probes (how many clusters to search, more probes means higher recall but slower search).

The one thing most people don’t realize is that the effectiveness of these algorithms hinges on the distribution of your data. If your vectors are highly clustered, IVF shines. If they form more of a continuous, interconnected structure, HNSW often performs better. The "lie" comes from the greedy search: it assumes the closest neighbor found so far is on the optimal path, which isn’t always true in high dimensions due to the "curse of dimensionality."

The next problem you’ll likely encounter is how to choose the right ANN algorithm and tune its parameters for your specific dataset and query workload.

Want structured learning?

Take the full Vector-databases course →