A hash vindex is not what you think it is; it’s not a distribution key for sharding, but rather a way to distribute rows evenly across shards that already exist.
Let’s see it in action. Imagine we have a users table sharded by user_id, but we also want to efficiently look up users by email. We can use a hash vindex on email to distribute these lookups evenly across our existing shards.
-- Schema definition with a hash vindex on email
CREATE TABLE users (
user_id BIGINT NOT NULL,
email VARCHAR(255) NOT NULL,
name VARCHAR(255),
PRIMARY KEY (user_id)
) ENGINE=InnoDB;
-- Vindex definition
-- This vindex will hash the email and distribute the resulting rows
-- across the existing shards defined by the primary vindex (user_id).
-- It does NOT dictate the sharding itself.
CREATE VINDEX vindex_email_hash
ON users (email)
USING HASH;
When you query for a user by email, Vitess uses the vindex_email_hash to determine which shard is most likely to contain that user. It doesn’t guarantee a single shard, but rather distributes the potential shard candidates evenly.
The core problem Vitess solves is managing large datasets that exceed the capacity of a single database instance. Sharding is the primary mechanism for this, splitting data across multiple independent database servers. However, efficiently querying this sharded data, especially by non-sharding keys, presents a challenge. Vindexes are Vitess’s solution to this.
A hash vindex, specifically, is designed for distributing lookups across existing shards. If you have a table sharded by user_id, and you want to query by email, a hash vindex on email will hash the email value and use that hash to determine a set of shards that might contain the row. Vitess then queries those shards. The key here is that it’s not defining the sharding itself, but rather distributing the work of finding the row across the existing sharding scheme. This is different from a lookup vindex.
A lookup vindex, on the other hand, is like a secondary index. It maps a value (e.g., email) to a primary vindex value (e.g., user_id). When you query by email, Vitess first queries a separate "lookup" table (or a virtual table managed by Vitess) to find the corresponding user_id, and then uses the primary vindex on user_id to find the shard. This is a two-step process: first, find the primary key, then find the shard.
The mental model for hash vindex is about distribution across existing shards for even workload. For lookup vindex, it’s about mapping a non-primary key to a primary key to then find the shard.
The difference between a hash vindex and a lookup vindex becomes critical when you consider the read patterns and write overhead. A hash vindex is generally more performant for read-heavy workloads where you frequently query by the indexed column, as it avoids the extra lookup step. However, it doesn’t offer the same precision in identifying a single shard as a lookup vindex can (if the lookup vindex maps to the primary key directly).
Vitess will route a query using a hash vindex by calculating the hash of the indexed column and then using a "scatter-gather" approach if the primary vindex is not a hash vindex itself. It will query all shards that match the hash. If the primary vindex is a hash vindex, then the hash vindex on the secondary column can effectively narrow down the search to a subset of shards.
You might be tempted to use a hash vindex on a column that you intend to use for sharding. This is a common misconception. A hash vindex, by itself, does not define the sharding of a table. It relies on an existing sharding mechanism (usually defined by the primary vindex) to distribute its results. If you want to shard by email, you’d typically use a hash vindex as the primary vindex for the table, not as a secondary vindex.
The most surprising thing is that a hash vindex doesn’t tell Vitess which shard has the data, but rather how to distribute the search for that data across existing shards. It’s a distribution mechanism for lookups, not a direct mapping to a single shard unless combined with a sharded primary vindex.
The next concept you’ll grapple with is how to handle multi-column vindexes and their interplay with sharding strategies.