Ordered Scatter Queries in Vitess are a way to execute ORDER BY statements across multiple shards, something that’s not inherently possible with standard scatter-gather operations.
Here’s a Vitess cluster set up for ordered scatter queries:
# Schema for sharded tables
CREATE TABLE customers (
customer_id BIGINT AUTO_INCREMENT,
name VARCHAR(100),
shard_key INT, -- Assuming shard_key is part of the sharding
PRIMARY KEY (customer_id, shard_key)
) ENGINE=InnoDB;
CREATE TABLE orders (
order_id BIGINT AUTO_INCREMENT,
customer_id BIGINT,
order_date DATETIME,
amount DECIMAL(10, 2),
shard_key INT,
PRIMARY KEY (order_id, shard_key)
) ENGINE=InnoDB;
-- Sharding configuration (example)
{
"sharding_key_based_on_column": "shard_key",
"vindexes": {
"hash": {
"type": "hash",
"params": {
"num_shards": "4"
}
}
},
"tables": {
"customers": "hash",
"orders": "hash"
}
}
Now, let’s imagine we have a query like:
SELECT c.customer_id, c.name, o.order_id, o.order_date, o.amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id AND c.shard_key = o.shard_key
WHERE c.shard_key BETWEEN 1 AND 2 -- Targeting specific shards for this example
ORDER BY o.order_date DESC
LIMIT 10;
Without Ordered Scatter, Vitess would execute this as a scatter query. Each shard would independently sort its orders and return results. Then, Vitess would gather these results and perform a final sort on the aggregated data to apply the ORDER BY clause. This final sort can become a significant bottleneck if the total result set before the final sort is large.
Ordered Scatter Queries optimize this by pushing down the sorting logic. When Vitess detects an ORDER BY clause that can be satisfied entirely by the sharding key or a prefix of the sharding key (or if the ORDER BY columns are part of the join condition and can be correlated across shards), it can instruct each shard to not only sort its local data but also to return only the top N (where N is related to the LIMIT) results already sorted. Vitess then only needs to merge these already sorted, limited result sets from each shard. This dramatically reduces the amount of data that needs to be processed in the final aggregation step, often to just N * num_shards rows, or even fewer if further filtering occurs.
The key here is that the ORDER BY clause must be "shard-aware" or "shard-optimal." This typically means the ORDER BY columns are part of the sharding key, or are directly correlated with the sharding key in a way that allows each shard to produce a correctly ordered subset. For example, if orders is sharded by shard_key and customer_id, and we query ORDER BY order_date, this won’t automatically trigger Ordered Scatter unless order_date is also somehow tied to the sharding. However, if we were sorting by customer_id or shard_key in a specific way that aligns with the sharding, Vitess could leverage this.
The real magic happens when Vitess analyzes the query plan. If it sees an ORDER BY that can be satisfied by the sharding scheme (e.g., ORDER BY shard_key), it will execute the query in an ordered scatter mode. This involves sending a modified query to each shard, typically including a LIMIT clause on the shard itself. For instance, if the overall query has LIMIT 10, Vitess might send SELECT ... ORDER BY o.order_date DESC LIMIT 100 to each shard (the 100 is a heuristic, often related to LIMIT * N or a configured value). Each shard returns its top 100 sorted results, and Vitess then merges these num_shards * 100 rows and picks the final top 10. This avoids sorting potentially millions of rows in the vttablet aggregation layer.
The most surprising true thing about Ordered Scatter Queries is that they don’t require any special configuration changes to Vitess itself; they are an optimization that Vitess automatically applies when the query structure and schema allow it. The primary lever you control is the design of your schema and the way you write your queries, ensuring that ORDER BY clauses align with sharding keys where possible.
The next concept you’ll likely run into is understanding how Vitess handles ORDER BY clauses that span multiple columns and how that interacts with sharding.