TimescaleDB compression doesn’t just save disk space; it fundamentally changes how your data is accessed and can dramatically improve query performance on historical data.
Here’s a hypertable with some data, and let’s see what compression looks like before we do anything:
CREATE TABLE metrics (
time TIMESTAMPTZ NOT NULL,
device INT NOT NULL,
metric_name TEXT NOT NULL,
value DOUBLE PRECISION
);
SELECT create_hypertable('metrics', 'time');
-- Insert some sample data
INSERT INTO metrics
SELECT
time,
device,
'temperature' AS metric_name,
random() * 50 + 10
FROM generate_series(
'2023-01-01 00:00:00'::timestamptz,
'2023-12-31 23:59:59'::timestamptz,
'1 minute'::interval
) AS time
CROSS JOIN generate_series(1, 100) AS device;
INSERT INTO metrics
SELECT
time,
device,
'humidity' AS metric_name,
random() * 100
FROM generate_series(
'2023-01-01 00:00:00'::timestamptz,
'2023-12-31 23:59:59'::timestamptz,
'1 minute'::interval
) AS time
CROSS JOIN generate_series(1, 100) AS device;
-- Let's see the current state
\d+ metrics
The output of \d+ metrics will show you the basic table structure, but to see compression details, you need to query TimescaleDB’s catalog.
To understand compression, we need to talk about chunks. A hypertable is a logical table that TimescaleDB automatically partitions into smaller physical tables called chunks. Compression is applied at the chunk level.
Measuring Compression
To measure the effectiveness of compression, you need to look at the timescaledb_information.compressed_chunks view. This view tells you, for each compressed chunk, the original size, the compressed size, and the compression ratio.
-- First, let's compress the data.
-- We'll create a compression policy that compresses data older than 7 days.
ALTER TABLE metrics SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'device, metric_name'
);
CREATE COMPRESSION POLICY compress_old_data
ON metrics
START (NOW() - INTERVAL '7 days')
END (NOW() - INTERVAL '1 day')
WITH (
compress_segmentby = 'device, metric_name',
compress_chunk_time_interval = '1 month'
);
-- Wait a bit for compression to kick in.
-- For demonstration, you might need to manually compress if the policy hasn't run.
-- SELECT compress_chunk(chunk_oid) FROM timescaledb_information.chunks WHERE hypertable_name = 'metrics' AND compressed_chunk_id IS NULL;
-- Now, let's check the compression status.
SELECT
c.chunk_name,
pg_size_pretty(SUM(COALESCE(uncompressed_size, 0))) AS uncompressed_size,
pg_size_pretty(SUM(COALESCE(compressed_size, 0))) AS compressed_size,
CASE
WHEN SUM(COALESCE(uncompressed_size, 0)) = 0 THEN 0
ELSE ROUND(SUM(COALESCE(compressed_size, 0)) * 100.0 / SUM(COALESCE(uncompressed_size, 0)), 2)
END AS compression_percentage
FROM timescaledb_information.chunks c
LEFT JOIN timescaledb_information.compressed_chunks cc ON c.chunk_id = cc.chunk_id
WHERE c.hypertable_name = 'metrics'
GROUP BY c.chunk_name
ORDER BY c.chunk_name;
This query will show you a list of your chunks, their original sizes (if uncompressed), their compressed sizes, and the percentage reduction. You’ll likely see very high compression ratios, often 80-95% or more, for time-series data.
The key to understanding this is that TimescaleDB compresses columns within a chunk, not the whole chunk as a single unit. The compress_segmentby clause tells TimescaleDB how to group data before compression. If you have a device column and a metric_name column, segmenting by these means that all rows for a specific device and metric within a chunk will be compressed together. This is incredibly effective because data for a single device and metric tends to be very similar (e.g., temperature readings from device 1 are all numbers in a similar range).
Tuning Compression
Tuning compression is about finding the right balance between compression ratio, query performance, and the overhead of compression itself.
The most impactful settings are:
compress_segmentby: This is crucial. Choose columns that have low cardinality (few unique values) and where values within those groups are highly repetitive. For sensor data,deviceandmetric_nameare almost always good candidates. If you have a very high number of devices or metric names, and they change frequently, you might get less benefit.compress_chunk_time_interval: This setting determines how large the chunks are that TimescaleDB compresses. Smaller intervals mean more chunks, which can increase metadata overhead but might allow for more granular compression. Larger intervals mean fewer chunks, potentially better compression as there’s more data to find patterns in, but also longer query times if you only need a small slice of a large chunk. A common starting point is1 month.- Compression Settings (Advanced): TimescaleDB uses
pg_partmanunder the hood for chunking and has its own compression settings. You can specify compression levels and algorithms when creating the compression policy, but the defaults are usually excellent. For example:
TheCREATE COMPRESSION POLICY compress_old_data ON metrics START (NOW() - INTERVAL '7 days') END (NOW() - INTERVAL '1 day') WITH ( compress_segmentby = 'device, metric_name', compress_chunk_time_interval = '1 month', -- These are advanced and rarely need changing from defaults -- optimize_compression = true, -- default is true -- compress_level = 5 -- default is 5 (0-9) );optimize_compressionsetting (defaulttrue) means TimescaleDB will analyze the data and choose the best compression algorithms for each column. This is almost always what you want.
How to tune:
- Start with defaults: For most use cases,
compress_segmentby = 'device, metric_name'andcompress_chunk_time_interval = '1 month'are excellent starting points. - Monitor performance: After enabling compression, run your typical queries. If queries on historical data become slower, it might indicate that your
compress_segmentbyis too broad or yourcompress_chunk_time_intervalis too large, leading to large chunks that still require significant decompression. - Experiment with
compress_segmentby: If you have a dataset wheredeviceandmetric_namearen’t highly repetitive, consider adding or removing columns. For example, if you have different types of sensors, you might add asensor_typecolumn tocompress_segmentby. - Experiment with
compress_chunk_time_interval: If your data ingest is very high volume, you might try smaller intervals (e.g.,7 days) to reduce the size of individual chunks being compressed and decompressed. Conversely, if you have low volume and long queries, larger intervals (e.g.,3 months) might help.
The most counterintuitive aspect of TimescaleDB compression is how it interacts with queries. When you query compressed data, TimescaleDB automatically decompressing only the necessary columns and rows on-the-fly. This means you don’t need to explicitly "uncompress" data. The query planner is smart enough to figure out what needs to be decompressed. If your query only needs device and value for a specific time range, only those columns for relevant chunks will be touched and decompressed. This selective decompression is why compression can often speed up queries on historical data, as less data needs to be read from disk, even with the decompression overhead.
The next step after optimizing compression is often exploring different data retention policies and how they interact with compressed data.