A TimescaleDB distributed hypertable isn’t just a single database instance with sharding; it’s a coordinated network of nodes where any node can own any chunk, making data placement and retrieval surprisingly dynamic.
Let’s see this in action. Imagine we have three nodes: node1 (coordinator), node2 (data node), and node3 (data node).
First, we create the timescaledb_fdw extension on node1 and link to the data nodes.
-- On node1 (coordinator)
CREATE EXTENSION IF NOT EXISTS timescaledb_fdw;
-- Link to node2
CREATE SERVER node2_server
FOREIGN DATA WRAPPER timescaledb_fdw
OPTIONS (host 'node2_hostname', port '5432', dbname 'timescaledb');
CREATE USER MAPPING FOR current_user
SERVER node2_server
OPTIONS (user 'postgres', password 'your_password');
-- Link to node3
CREATE SERVER node3_server
FOREIGN DATA WRAPPER timescaledb_fdw
OPTIONS (host 'node3_hostname', port '5432', dbname 'timescaledb');
CREATE USER MAPPING FOR current_user
SERVER node3_server
OPTIONS (user 'postgres', password 'your_password');
Now, we create the distributed hypertable on node1, telling it about the available data nodes.
-- On node1 (coordinator)
CREATE TABLE sensor_data (
time TIMESTAMPTZ NOT NULL,
device_id INT NOT NULL,
temperature DOUBLE PRECISION
);
SELECT create_distributed_hypertable('sensor_data', 'time', 'device_id', number_partitions => 4, replication_factor => 2, data_nodes => ARRAY['node2_server', 'node3_server']);
Here’s where it gets interesting: replication_factor => 2 means each chunk will be stored on two different data nodes. number_partitions => 4 determines how many independent shardings we have for the time and device_id composite partitioning.
When you insert data, TimescaleDB automatically figures out which nodes should receive which chunks based on the partition key (time, device_id) and the replication_factor.
-- On node1 (coordinator)
INSERT INTO sensor_data (time, device_id, temperature) VALUES
('2023-10-27 10:00:00+00', 1, 25.5),
('2023-10-27 10:00:00+00', 2, 26.1),
('2023-10-28 11:00:00+00', 1, 25.8);
To see where the data actually lives, you’d query internal TimescaleDB catalog tables. Let’s assume node2 is 192.168.1.102 and node3 is 192.168.1.103.
-- On node1 (coordinator)
SELECT
c.id AS chunk_id,
c.range AS chunk_time_range,
dn.node_name,
dn.hostname
FROM
timescaledb_catalog.chunk c
JOIN
timescaledb_catalog.chunk_data cd ON c.id = cd.chunk_id
JOIN
timescaledb_catalog.dist_node dn ON cd.node_id = dn.id
WHERE
c.hypertable_id = ('sensor_data')::regclass::oid;
The output might show something like this:
| chunk_id | chunk_time_range | node_name | hostname |
|---|---|---|---|
| 1001 | [2023-10-27 00:00:00+00, 2023-10-28 00:00:00+00) | node2 | 192.168.1.102 |
| 1001 | [2023-10-27 00:00:00+00, 2023-10-28 00:00:00+00) | node3 | 192.168.1.103 |
| 1002 | [2023-10-28 00:00:00+00, 2023-10-29 00:00:00+00) | node2 | 192.168.1.102 |
| 1002 | [2023-10-28 00:00:00+00, 2023-10-29 00:00:00+00) | node3 | 192.168.1.103 |
Notice how chunk_id 1001 is present on both node2 and node3. This is the replication in action. When you query sensor_data, the coordinator (node1) intelligently routes the query to the nodes that hold the relevant chunks, and if replication is configured, it can even choose the closest or least loaded replica.
The mental model here is that create_distributed_hypertable doesn’t pre-assign data to nodes. Instead, it defines a partitioning scheme and a replication strategy. TimescaleDB then dynamically manages the creation and placement of chunks across the specified data nodes according to these rules. The coordinator node acts as the intelligence, orchestrating data distribution and query routing.
The data_nodes array is crucial: it tells the coordinator which foreign servers (representing your data nodes) are available for hosting data. If you omit data_nodes, TimescaleDB will still distribute data across the nodes it knows about via timescaledb_fdw setup, but explicitly listing them ensures you control where your data resides.
One aspect that often surprises people is how replication_factor interacts with number_partitions. If you have replication_factor => 3 and number_partitions => 4, each of the 4 logical partitions will be split into 3 physical chunks, distributed across your data nodes. TimescaleDB will attempt to spread these replicas evenly. For example, with 2 data nodes and replication_factor => 2, it’s straightforward. With 3 data nodes and replication_factor => 2, it will distribute chunks such that no single node holds all replicas for any given chunk, but some nodes might hold more replicas than others depending on the total number of chunks and available nodes.
The next concept you’ll likely grapple with is how to manage node failures and rebalancing data when nodes are added or removed.