The most surprising truth about matching embedding models to vector database indexes is that the "best" index isn’t determined by the model’s dimensionality alone, but by the distribution of your data and the specific query patterns you expect.

Let’s see this in action. Imagine we’re building a recommendation engine for a music streaming service. We’ve got a collection of songs, and for each song, we have an embedding generated by a model like all-MiniLM-L6-v2, which produces 384-dimensional vectors.

Here’s a snippet of what that data might look like (simplified):

[
  {"id": "song_1", "embedding": [0.123, -0.456, ..., 0.789]},
  {"id": "song_2", "embedding": [-0.987, 0.654, ..., -0.321]},
  // ... millions more songs
]

We want to be able to quickly find songs similar to a given query song. This is where vector databases and their indexes come in.

The Problem: Naive Search is Too Slow

If we didn’t use an index, finding similar songs would mean calculating the distance (e.g., cosine similarity) between our query vector and every single song embedding in our database. For millions of songs, this is computationally prohibitive for real-time applications.

The Solution: Approximate Nearest Neighbor (ANN) Indexes

Vector databases use ANN algorithms to trade a tiny bit of accuracy for massive speed gains. Instead of checking everything, they build structures that allow them to quickly narrow down the search space to a small subset of likely candidates.

Matching Models to Indexes: It’s Not Just About Dimension

You might think that if your embedding model outputs 384-dimensional vectors, you’d pick an index that’s optimized for 384 dimensions. While dimensionality is a factor, it’s often secondary to:

  1. Data Distribution: Is your data clustered tightly in some areas and sparse in others? Or is it relatively uniformly distributed?
  2. Query Patterns: Are you mostly looking for very close matches, or are you okay with finding "good enough" matches that might be slightly further away? Are you doing single-vector searches or batch searches?

Let’s look at some common index types and how they relate to embedding models and data characteristics. We’ll use pgvector as an example, but the concepts apply broadly to other vector databases.

1. ivfflat (Inverted File Flat)

  • How it works: ivfflat partitions your vector space into N clusters (using k-means). When you query, it searches only a specified number (m) of the closest clusters to your query vector.
  • When to use it: This is a good general-purpose index, especially when your data has some degree of clustering. It offers a good balance between speed and accuracy. It’s also relatively robust to different data distributions.
  • Configuration:
    CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
    
    • lists = 100: This means the index will create 100 initial clusters for your data. The optimal number depends on your dataset size and desired recall. For a dataset of 1 million vectors, 100 lists might be too few, leading to more vectors per list and slower searches. You might aim for a few thousand vectors per list.
    • vector_cosine_ops: Specifies the distance metric. You must use the same metric here that your embedding model was trained with or that you use for similarity search. all-MiniLM-L6-v2 is typically used with cosine similarity.
  • Why it works: By only scanning a subset of the data (the m closest lists), it drastically reduces the number of distance calculations. The lists parameter controls the granularity of the partitioning.
  • Matching to Model: Works well for models like all-MiniLM-L6-v2 (384D) or even higher dimensional models, as the k-means clustering is dimension-agnostic.

2. hnsw (Hierarchical Navigable Small World)

  • How it works: hnsw builds a multi-layer graph. Each layer is a graph where nodes are vectors and edges connect "close" vectors. Higher layers have fewer nodes and longer-range connections, acting as an index to quickly find a region in the graph. Lower layers have more nodes and shorter-range connections for finer-grained searching.
  • When to use it: hnsw generally offers higher recall (better accuracy) at comparable speeds to ivfflat, especially for high-dimensional data and when you need very fast queries. It’s often the go-to for many applications. It’s less sensitive to data distribution than ivfflat and can handle more uniform distributions better.
  • Configuration:
    CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);
    
    • m = 16: Controls the maximum number of neighbors each node can have in the graph. Higher m leads to a more robust graph but increases build time and memory usage.
    • ef_construction = 64: During index building, this parameter controls the size of the dynamic list for exploring neighbors. Higher values mean more exploration, potentially leading to a better graph but slower construction.
    • During search, you control ef_search (e.g., SET ef_search = 100; SELECT ...). Higher ef_search means more exploration during query time, increasing accuracy and latency.
  • Why it works: The hierarchical graph structure allows for efficient traversal. Starting from the top layer, it quickly navigates to the relevant region, then drills down to the more detailed layers for precise neighbor finding.
  • Matching to Model: Excellent for models like all-MiniLM-L6-v2 (384D) and can scale to even higher dimensions, though performance may degrade beyond a few thousand dimensions.

3. l2_distance (Brute Force - Not an Index)

  • How it works: This isn’t an ANN index. It’s a simple scan and calculate. pgvector supports l2_distance (Euclidean distance) and vector_cosine_ops (cosine similarity) as operators. When you query without an index, it performs a full scan.
  • When to use it: Never for production datasets of any significant size. Only useful for very small datasets for testing or initial development.
  • Configuration: No index creation needed, just a query:
    SELECT id FROM items ORDER BY embedding <-> '[query_vector]' LIMIT 10; -- For L2 distance
    SELECT id FROM items ORDER BY embedding <=> '[query_vector]' LIMIT 10; -- For cosine similarity (note: pgvector uses <=> for cosine)
    
  • Why it works: It’s guaranteed to be 100% accurate because it checks every single item.
  • Matching to Model: Works with any dimension, but the time complexity makes it unusable.

The Counterintuitive Lever: ef_search and m in HNSW

The most impactful parameters for tuning hnsw performance and accuracy aren’t always obvious. For hnsw, the ef_search parameter during a query is your primary knob for trading accuracy for speed. A high ef_search (e.g., 200) will give you near-perfect recall but take longer. A low ef_search (e.g., 10) will be very fast but might miss some genuinely close neighbors. The m parameter during construction influences how well the graph is built, and a higher m can enable better performance with higher ef_search values. For ivfflat, it’s the lists parameter and the number of lists searched during query (which is implicitly controlled by pgvector’s ef parameter during query on ivfflat).

Choosing Your Index

  • Start with hnsw: For most modern applications, hnsw is a strong default. It handles various data distributions and high dimensions well. Tune ef_construction and ef_search based on your recall and latency requirements.
  • Consider ivfflat: If your data is known to be highly clustered and you need a simpler index, or if memory is a significant constraint (as hnsw can be more memory-intensive), ivfflat is a viable option. Tune lists to get a good balance.
  • Embeddings and Index Parameters: The dimensionality of your embedding model (e.g., 384 from all-MiniLM-L6-v2) is a factor, but the distribution of those embeddings is more critical. If your 384D embeddings are all over the place, hnsw will likely perform better than ivfflat. If they form distinct, well-separated clusters, ivfflat might be competitive.

The next step is often understanding how to perform batch queries efficiently and how to tune the index parameters based on empirical testing with your specific dataset and query load.

Want structured learning?

Take the full Vector-databases course →