Benchmark Vector Databases: QPS, Recall, and Latency (2026)

The surprising truth about benchmarking vector databases is that the "best" database isn’t a fixed entity; it’s a moving target defined by your specific query patterns and tolerance for trade-offs.

Let’s see this in action. Imagine we have a dataset of 1 million product descriptions, each embedded into a 128-dimensional vector. We want to build a "similar products" recommendation engine. Here’s a simplified look at how a vector database might handle a search for similar items:

When a user views "Product A," its vector V_A is used to query the database. The database, instead of exact matching, finds vectors V_B, V_C, etc., that are "closest" to V_A in the high-dimensional space. This "closeness" is typically measured by a distance metric like cosine similarity or Euclidean distance.

# Conceptual example, not actual DB query
from sklearn.metrics.pairwise import cosine_similarity

# Assume vector_db is a connected vector database client
# Assume get_embedding("Product A") returns V_A
query_vector = get_embedding("Product A")

# This is what the database does conceptually:
# It iterates through its indexed vectors (V_1, V_2, ... V_N)
# and calculates similarity/distance to query_vector.
# It then returns the top-K most similar vectors.

# Example of calculating cosine similarity in Python
all_vectors = get_all_vectors_from_db() # In reality, this is a DB operation
similarities = cosine_similarity([query_vector], all_vectors)[0]

# Find indices of top 5 most similar products
top_k_indices = similarities.argsort()[-5:][::-1]
top_k_products = get_products_by_indices(top_k_indices)

print(f"Products similar to Product A: {top_k_products}")

The core problem vector databases solve is efficient similarity search in high dimensions. Traditional databases excel at exact matches (WHERE user_id = 123) or range queries (WHERE price BETWEEN 10 AND 20). They struggle with "find me things like this" when "like" is defined by complex, multi-dimensional relationships. Vector databases use specialized indexing structures (like HNSW, IVF, or Annoy) to approximate nearest neighbors, making these searches feasible at scale.

You control the behavior of a vector database primarily through its index configuration and query parameters.

Index Configuration:

Index Type: HNSW (Hierarchical Navigable Small Worlds) is common for its balance of speed and accuracy. IVF (Inverted File Index) is often faster but can be less precise.
ef_construction (HNSW): Controls the trade-off between build time and the quality of the graph constructed. Higher values mean longer build times but potentially better search performance later.
M (HNSW): The number of bi-directional links created for each node during construction. Higher M leads to a more connected graph, improving recall but increasing memory usage and build time.
nlist (IVF): The number of clusters (centroids) to use. A larger nlist means more precise partitioning but can lead to more probes during search.
nprobe (IVF): The number of clusters to search during a query. Higher nprobe increases recall but also latency.

Query Parameters:

k: The number of nearest neighbors to return.
ef_search (HNSW): During search, this parameter controls the size of the dynamic candidate list. Higher ef_search increases recall but also latency.

When you’re benchmarking, you’re essentially exploring the Pareto frontier between Query Per Second (QPS), Recall, and Latency.

QPS: How many queries the database can handle per second. Higher is better.
Recall: The percentage of true nearest neighbors (found by a brute-force exhaustive search) that your approximate nearest neighbor (ANN) search returns. A recall of 0.95 means you’re finding 95% of the actual closest items.
Latency: The time it takes for a single query to complete. Lower is better.

These three metrics are almost always in tension. To get higher QPS, you might reduce ef_search or nprobe, which lowers recall. To increase recall, you’ll likely increase ef_search or nprobe, which increases latency and decreases QPS.

Let’s say you’re using Pinecone and want to tune for a balance. You might start with a pod_type like s1.x1 and observe initial QPS and latency. Then, you’d adjust query_top_k and index_config parameters.

// Example Pinecone index configuration
{
  "metric": "cosine",
  "spec": {
    "pod_type": "s1.x1",
    "pods": 1
  },
  "index_config": {
    "growing_thread_count": 16,
    "hns_config": {
      "ef_constructor": 100,
      "m": 16
    },
    "on_demand_pagination": false
  }
}

Your benchmark script would look something like this:

import time
import random
from pinecone import init, Index

# Assume 'init' and 'Index' are configured for your Pinecone environment

index = Index("my-vector-index")
num_queries = 1000
query_vectors = [generate_random_vector() for _ in range(num_queries)] # Your test data
top_k_to_request = 10

start_time = time.time()
results = index.query(
    queries=query_vectors,
    top_k=top_k_to_request,
    include_values=False,
    # Experiment with 'filter' here if applicable
)
end_time = time.time()

total_latency = end_time - start_time
avg_latency = total_latency / num_queries
qps = num_queries / total_latency

print(f"Average Latency: {avg_latency:.4f}s")
print(f"QPS: {qps:.2f}")

# To measure recall, you'd need a ground truth dataset
# and compare the returned IDs against the true nearest neighbors.
# This is a more involved process.

The crucial insight for benchmarking is that your query distribution matters. If your actual use case involves many small k values (e.g., finding 5 similar items), benchmarking with a large k won’t reflect reality. Similarly, if your queries are always clustered in one region of the vector space, performance might differ significantly from a uniform distribution. Many benchmarks use synthetic data; real-world data, with its inherent biases and clusters, will behave differently.

The one thing that often trips people up is that an ANN index’s performance isn’t static; it degrades gracefully as you query further away from the "densest" areas of your training data, or as the data itself drifts. This means that even if your initial benchmark shows excellent recall, performance might degrade over time or under specific query loads that hit these less-trafficked areas of the index. You need to design benchmarks that stress these edge cases and monitor performance in production, not just rely on a single, pristine test run.

After optimizing for QPS, Recall, and Latency, your next challenge will be managing the cost implications of different vector database configurations and scaling strategies.

More Deep Dives in Vector Databases