Vitess can automatically shard your data based on a primary key or a set of primary keys.
Let’s see this in action. Imagine we have a users table and we want to shard it by user_id.
CREATE TABLE users (
user_id BIGINT NOT NULL,
name VARCHAR(100),
email VARCHAR(100),
PRIMARY KEY (user_id)
);
To tell Vitess to shard this table by user_id, we define a vschema.json file:
{
"sharded": true,
"vindexes": {
"hash": {
"type": "hash"
}
},
"tables": {
"users": {
"column_vindexes": [
{
"column": "user_id",
"name": "hash"
}
]
}
}
}
In this vschema, we declare users as a sharded table. We define a hash vindex, which is a built-in vindex type in Vitess that distributes values evenly. Then, we link the user_id column of the users table to this hash vindex. Vitess will use this mapping to determine which shard a given user_id belongs to.
The core problem Vitess solves with sharding is the scalability limitation of a single database instance. As your data grows and your query load increases, a single machine can become a bottleneck. Sharding, or horizontal partitioning, distributes your data across multiple database instances (shards). Vitess automates this process, making it transparent to your application.
Internally, Vitess uses a concept called vindexes. Vindexes are a layer of indirection between your application and the underlying MySQL shards. They map logical values (like user_id) to physical locations (which shard holds the data for that user_id). Vitess offers several vindex types, including hash, numeric, reverse_bits, and lookup, each suited for different sharding strategies and access patterns.
The sharded: true flag in the vschema is the primary indicator that a table is intended to be sharded. The column_vindexes section then specifies which column(s) will be used for sharding. Vitess uses the defined vindex (in this case, hash) to compute the shard key for each row. When you insert a row, Vitess calculates the vindex value and routes the query to the appropriate shard. When you query, it uses the vindex to find the correct shard(s) to query.
When you have a composite primary key, like (user_id, order_id), and you want to shard by user_id, your vschema would look slightly different. You’d still use a hash vindex on user_id, but Vitess understands how to handle the composite key.
{
"sharded": true,
"vindexes": {
"user_id_hash": {
"type": "hash"
}
},
"tables": {
"orders": {
"primary_key": ["user_id", "order_id"],
"column_vindexes": [
{
"column": "user_id",
"name": "user_id_hash"
}
]
}
}
}
Here, orders is sharded by user_id. Vitess will ensure all orders for a given user_id reside on the same shard, even though order_id is also part of the primary key. This is crucial for queries that filter by user_id.
The distribution of data across shards is managed by Vitess’s vreplication and vtctl tools. You define the number of shards, and Vitess handles the creation of the underlying MySQL instances and the routing configuration. The vschema is the central piece of configuration that tells Vitess how to distribute the data.
A common misconception is that sharding by a simple hash vindex is always optimal. While it provides excellent distribution, it can make range-based queries inefficient if the sharding column isn’t part of the query. For example, if you shard users by user_id using a hash vindex and then frequently query for users within a specific user_id range (e.g., WHERE user_id BETWEEN 1000 AND 2000), Vitess might have to query across all shards to satisfy that request. In such cases, a numeric vindex or a composite sharding strategy might be more appropriate.
The next step in managing your Vitess data after defining your sharding strategy is to understand how to perform resharding operations.