The optimal chunk_interval in TimescaleDB isn’t about making chunks as large as possible, but about aligning them with your data’s ingestion and query patterns to minimize overhead and maximize performance.
Let’s see this in action. Imagine a typical IoT sensor scenario. We have devices sending readings every minute. We want to query for average temperature per hour over the last day.
Here’s a simplified data insertion:
-- Assume 'readings' hypertable is created with 'time' as the time dimension
INSERT INTO readings (time, device_id, temperature) VALUES
('2023-10-27 10:00:00', 1, 22.5),
('2023-10-27 10:01:00', 1, 22.6),
('2023-10-27 10:02:00', 1, 22.7),
-- ... millions more rows ...
('2023-10-28 10:00:00', 1, 23.1);
And a common query:
SELECT
time_bucket('1 hour', time) AS hour_bucket,
AVG(temperature) AS avg_temp
FROM readings
WHERE time >= NOW() - INTERVAL '1 day'
GROUP BY hour_bucket
ORDER BY hour_bucket;
The chunk_interval dictates how TimescaleDB partitions your hypertable into smaller, manageable tables called chunks. Each chunk contains data for a specific time range. The default chunk_interval is 7 days.
The Problem: Mismatched chunk_interval
If your chunk_interval is set to 1 week and your data arrives in 1-minute intervals, you’ll end up with a massive number of tiny chunks. This leads to:
- High Metadata Overhead: TimescaleDB needs to manage metadata for every single chunk. Too many chunks means more work for the system to track them.
- Slower Queries: When you query a time range, TimescaleDB needs to scan the metadata to identify relevant chunks. If there are millions of small chunks, this scanning becomes a bottleneck.
- Inefficient Data Pruning: Dropping old data (e.g.,
DROP TABLE ... FROM _timescaledb_internal.chunks WHERE range < ...) is faster when you can drop entire chunks. With a largechunk_interval, you might have to delete individual rows from large chunks, which is less efficient.
Conversely, if your chunk_interval is set to 1 year and you query data for the last hour, TimescaleDB has to scan a very large chunk, potentially loading more data than necessary.
The Solution: Aligning chunk_interval with Data Granularity and Query Patterns
The sweet spot for chunk_interval is typically a multiple of your data’s natural time granularity and large enough to encompass a reasonable number of data points for your common queries. A good starting point is often a multiple of your most frequent data ingestion interval, adjusted for your typical query windows.
For our sensor example, if data arrives every minute, and we often query by hour, a chunk_interval of 1 day or 3 days might be suitable.
Let’s say we choose 1 day. This means each chunk will contain approximately 24 hours * 60 minutes/hour = 1440 rows (if data is consistent).
How to Set/Change chunk_interval
You set chunk_interval when creating a hypertable:
CREATE TABLE readings (
time TIMESTAMPTZ NOT NULL,
device_id INT NOT NULL,
temperature DOUBLE PRECISION
);
SELECT create_hypertable('readings', 'time', chunk_interval => INTERVAL '1 day');
To change it on an existing hypertable, you need to use alter_chunk_time_interval. This operation can be resource-intensive as it might involve splitting or merging existing chunks.
-- Example: Changing chunk_interval to 3 days
ALTER TABLE readings SET (
timescaldb.chunk_time_interval = '3 days'
);
Internal Mechanics: The _timescaldb_internal.chunks Table
TimescaleDB maintains a system table, _timescaldb_internal.chunks, which acts as a catalog of all your chunks. Each row in this table represents a chunk and contains its schema_name, table_name, and importantly, range (a TSRange type representing the start and end time of the data within that chunk).
When you run a query with a WHERE time BETWEEN ... clause, TimescaleDB consults this _timescaledb_internal.chunks table. It efficiently identifies which chunks’ range overlaps with your query’s time window. The fewer chunks there are, the faster this lookup and selection process.
For example, if you have a chunk_interval of 1 day and query for the last 24 hours, TimescaleDB might only need to access a handful of _timescaledb_internal.chunks entries and then query the corresponding physical tables (the chunks themselves). If your chunk_interval was 1 minute, it might need to scan hundreds of entries for the same 24-hour window.
The key is that a chunk_interval that aligns with your query patterns means each chunk is "just right" – large enough to reduce metadata overhead and query planning time, but not so large that individual chunk scans become inefficient.
The next logical step after optimizing chunk sizing is understanding how to effectively manage data retention and lifecycle using TimescaleDB’s built-in policies.