Vitess VSchema design is less about defining tables and more about defining how your sharded tables relate to each other and how Vitess should manage that relationship.
Let’s see Vitess in action. Imagine a simple e-commerce scenario: users and their orders.
{
"sharded": true,
"vindexes": {
"user_id_vdx": {
"type": "numeric",
"column": "id"
}
},
"tables": {
"users": {
"column_vindexes": [
{
"column": "id",
"name": "user_id_vdx"
}
]
},
"orders": {
"column_vindexes": [
{
"column": "user_id",
"name": "user_id_vdx"
}
]
}
}
}
In this VSchema, we’ve declared users and orders as sharded tables. The user_id_vdx is a numeric vindex on the id column of users. This tells Vitess that the id column in users is the primary key for sharding. Crucially, we’ve also mapped orders.user_id to user_id_vdx. This establishes a foreign key relationship for sharding purposes, meaning Vitess will co-locate orders belonging to the same user on the same shard. This is key for efficient joins and lookups across these tables.
The problem Vitess VSchema design solves is distributed data management at scale. Without it, sharding complex relational data would be a nightmare of manual data placement, complex routing logic, and difficult cross-shard operations. Vitess VSchema allows you to define these relationships declaratively. Vitess then takes this definition and manages the data distribution, query routing, and schema changes across your shards.
Internally, when a query arrives, Vitess consults the VSchema. For a query like SELECT * FROM orders WHERE user_id = 123, Vitess sees that orders is sharded by user_id using user_id_vdx. It then uses the vindex to determine which shard user_id = 123 belongs to and routes the query directly to that shard. If the query involves a join between users and orders on user_id, Vitess knows that because orders.user_id is mapped to the same vindex as users.id, these rows are likely co-located, allowing for efficient in-shard joins.
The exact levers you control are the vindexes and tables sections. vindexes define how to map a column to a shard key. tables then specify which columns in your tables use which vindexes. You can define custom vindexes (like lookup or consistent_hash) or use built-in ones. For multi-column sharding, you can create composite vindexes. The decision of which column to shard on (your sharding key) is paramount. It should be a column frequently used in WHERE clauses, JOIN conditions, and ideally, a column that distributes data evenly.
A common pitfall is sharding on a primary key that doesn’t align with your access patterns. For example, sharding orders by order_id might seem natural, but if you always query orders by user_id, you’ll end up performing scatter-gather operations for most order-related queries, defeating the purpose of sharding. In such cases, you’d use a user_id vindex on the orders table and potentially a separate sharding scheme for users or use a synthetic sharding key.
The next concept you’ll encounter is managing resharding operations, particularly when your sharding key needs to change or your data distribution becomes uneven.