A vector database can process millions of queries per second, but it’s not by making individual nodes infinitely fast; it’s by distributing the load.
Let’s see what that looks like. Imagine we have a collection of 100 million vectors, each 128-dimensional. A single node might struggle to serve 10,000 QPS (queries per second) for similarity search on this dataset.
{
"collection_name": "my_embeddings",
"description": "User embeddings for recommendation engine",
"vector_params": {
"dimension": 128,
"distance": "Cosine"
},
"index_params": {
"type": "HNSW",
"m": 16,
"ef_construction": 200,
"ef": 500
},
"sharding_enabled": true,
"num_shards": 10,
"replication_factor": 3
}
This configuration tells our vector database to create 10 shards for my_embeddings. Each shard will hold roughly 10 million vectors. If each shard can handle 1,000 QPS, then 10 shards give us 10,000 QPS. The replication_factor: 3 means each shard will have 3 copies distributed across different nodes for high availability and read scaling.
The problem this solves is simple: a single machine has finite CPU, RAM, and I/O. As your dataset grows or your query volume increases, you hit those limits. Sharding breaks a large dataset into smaller, manageable chunks, each residing on a different node or set of nodes. Replication provides redundancy and allows read requests to be spread across multiple copies of the same data.
Internally, when a query comes in, the database’s query router (or a coordinating node) determines which shard(s) hold the relevant data. For a broad similarity search across the entire dataset, the query is broadcast to all shards. Each shard performs the search on its subset of vectors and returns its results. The router then aggregates these results, performs a final ranking if necessary, and returns the top-K results to the client.
For an exact match or a query targeting specific metadata that can prune shards, the router might only send the query to a subset of shards. This is where sharding offers more than just raw throughput; it can also reduce latency by avoiding unnecessary computation on irrelevant data.
The levers you control are num_shards and replication_factor. num_shards is directly tied to how you want to partition your data. A common strategy is to shard based on a tenant ID, a user ID prefix, or even a hash of a primary key. The goal is to distribute data and query load as evenly as possible. replication_factor dictates how many copies of each shard exist. A factor of 3 is common for production, balancing redundancy with resource utilization. If one node holding a replica fails, others can take over seamlessly. For read-heavy workloads, you can increase the replication_factor to allow more read requests to be served concurrently.
The index parameters (m, ef_construction, ef) and the vector dimension are critical to performance within a shard. A poorly chosen index or a very high dimension will limit the QPS a single shard can achieve, meaning you’ll need more shards or replicas to compensate.
The query router intelligently dispatches requests. If a query is broadcast to all 10 shards, and each shard has 3 replicas, your query is actually being executed by up to 30 nodes simultaneously. The router collects responses from the first few it hears from, aiming to satisfy the latency requirements of the query, and then aggregates them. It doesn’t necessarily wait for all 30 nodes to finish.
When you increase num_shards, you’re not just adding more storage; you’re adding more processing units to handle queries. Each shard runs its own search algorithm and index structures. More shards mean more independent search operations happening in parallel across your cluster.
If you have 10 shards and a replication factor of 3, you physically have 30 copies of your data spread across nodes. A query that needs to scan the entire dataset will be sent to all 10 shards. Each shard, in turn, has 3 replicas. The system will likely pick one replica from each shard to execute the query. If one of those chosen replicas is slow or unavailable, the query router can dynamically switch to another replica of that specific shard to ensure the overall query completes.
The next hurdle is optimizing the index build process across a sharded and replicated cluster.