TimescaleDB’s hash partitioning on a column doesn’t actually distribute data evenly across shards by default, it distributes hashes of the data evenly.

Let’s see how this plays out. Imagine you have a table sensor_readings with a device_id column, and you’ve partitioned it by device_id using hash. You might think this means each device_id goes to its own shard, or that the devices are spread out nicely. That’s not quite right.

Here’s a simplified CREATE TABLE statement:

CREATE TABLE sensor_readings (
    time TIMESTAMPTZ NOT NULL,
    device_id INT NOT NULL,
    temperature DOUBLE PRECISION
);

SELECT create_hypertable('sensor_readings', 'time', 'device_id', 4, chunk_time_interval => INTERVAL '1 day');

In this setup, device_id is the partitioning column. TimescaleDB will calculate a hash for each device_id and distribute these hashes across the configured number of chunks (4 in this case).

The critical part is that the hash value of the device_id is what’s distributed, not the device_id itself in a way that guarantees even spread of data points. If you have 1000 devices, but 99% of your data comes from device 123, all data for 123 will land in the same chunk, regardless of how many other devices exist or how their hashes might fall into different chunks.

Let’s look at what’s happening under the hood with a sample device_id and its hash. We can see this by querying TimescaleDB’s internal catalog.

First, let’s insert some data:

INSERT INTO sensor_readings (time, device_id, temperature)
VALUES
    ('2023-10-26 10:00:00 UTC', 101, 22.5),
    ('2023-10-26 10:01:00 UTC', 102, 23.1),
    ('2023-10-26 10:02:00 UTC', 101, 22.7),
    ('2023-10-26 10:03:00 UTC', 103, 24.0),
    ('2023-10-26 10:04:00 UTC', 101, 22.6);

Now, let’s inspect which chunk these belong to. We can use timescaledb_information.chunks and pg_catalog.md5 (or simply look at the chunk’s id if you know the distribution). A more direct way is to look at the chunk_id associated with the data.

-- This query shows which chunk ID a specific device_id would map to.
-- Note: The exact hash function and mapping can vary slightly between versions,
-- but the principle remains. We're simulating the distribution.
SELECT
    chunk_id,
    (SELECT chunk_id FROM timescaledb_information.chunks WHERE hypertable_name = 'sensor_readings' AND 101 = ANY(chunk_range::int[])) as mapped_chunk_id_for_101
FROM timescaledb_information.chunks
WHERE hypertable_name = 'sensor_readings'
LIMIT 1;

The output might show chunk_id values and a mapped_chunk_id_for_101. If 101 is a hot device producing many records, all those records will go to the chunk associated with 101’s hash.

The problem arises when a few device_ids are vastly more active than others. If device_id 101 is sending 1000x more data than devices 102-104 combined, and 101’s hash maps to chunk_id = 5, then chunk_id = 5 will become enormous, while other chunks remain relatively small. This leads to:

  • Uneven disk usage: One chunk (and thus one underlying table) grows much larger than others.
  • Slow queries: Queries filtering on device_id might hit a huge chunk, negating the benefits of partitioning.
  • Performance bottlenecks: Operations on the overloaded chunk become slow.

The mental model to build here is that hash partitioning distributes partition keys (or their hashes) across chunks, not necessarily the data volume evenly. The number of distinct values for your partition key matters, but so does the distribution of data among those values.

If you have a few "hot" partition keys that dominate your data volume, hash partitioning on that key will concentrate data. For such workloads, consider:

  1. Partitioning on a different column: If you have a timestamp or another dimension that is more evenly distributed, use that.
  2. Using a composite partition key: Partition by (time, device_id) or similar. This breaks data down first by time, then by device within that time.
  3. App-level sharding: If TimescaleDB’s partitioning isn’t granular enough, you might need to shard your data before it even hits the database, assigning device_ids to different database instances or schemas.

The one thing most people don’t realize is that hash partitioning is about distributing the keys, not the data points directly. So, if your key distribution is skewed (e.g., one device_id produces 90% of the data), your data distribution will also be skewed, even with hash partitioning.

The next hurdle you’ll likely encounter is dealing with "hot" chunks, where one partition’s data overwhelms the system.

Want structured learning?

Take the full Timescaledb course →