TimescaleDB compression doesn’t just save disk space; it fundamentally changes how your data is accessed and processed, making queries significantly faster by only reading the columns you actually need.
Let’s see it in action. Imagine a table iot_data with readings from thousands of sensors every second. Without compression, a query like SELECT avg(temperature) FROM iot_data WHERE time > '2023-10-26T10:00:00Z' would scan every column for every row within that time range.
CREATE TABLE iot_data (
time TIMESTAMPTZ NOT NULL,
sensor_id INT NOT NULL,
temperature DOUBLE PRECISION,
humidity DOUBLE PRECISION,
pressure DOUBLE PRECISION
);
-- Populate with some data (simplified for example)
INSERT INTO iot_data (time, sensor_id, temperature, humidity, pressure)
SELECT
generate_series('2023-10-01'::timestamptz, '2023-10-02'::timestamptz, '1 second'),
(i % 1000) + 1,
random() * 20 + 10, -- Temperature between 10 and 30
random() * 50 + 40, -- Humidity between 40 and 90
random() * 10 + 1000 -- Pressure between 1000 and 1010
FROM generate_series(1, 1000000) AS i;
-- Select a specific time range
EXPLAIN ANALYZE
SELECT avg(temperature) FROM iot_data WHERE time BETWEEN '2023-10-01T00:00:00Z' AND '2023-10-01T01:00:00Z';
Now, let’s compress it. TimescaleDB’s compression is row-segment based. It divides your data into chunks (managed by TimescaleDB’s hypertable system) and within each chunk, it groups rows and applies compression techniques per column. This is crucial: it’s not just compressing the whole row together.
To compress, you first need to define a compression policy. This policy tells TimescaleDB when to compress data and how to compress it.
-- Create a hypertable from the existing table
SELECT create_hypertable('iot_data', 'time');
-- Define a compression policy: compress data older than 7 days
ALTER TABLE iot_data SET (timescaledb.compress, timescaledb.compress_segmentby = 'sensor_id');
CREATE COMPRESSION POLICY iot_data_compress_policy
ON iot_data
START fecha_compression_policy_start -- This must be a TIMESTAMPTZ column. Let's assume we have one.
-- For this example, let's assume 'time' is the column to use for compression.
-- If you don't have a specific column for policy, you can use 'time' if it's your partitioning column.
-- Let's redefine the table slightly for a clear policy example.
DROP TABLE iot_data;
CREATE TABLE iot_data (
time TIMESTAMPTZ NOT NULL,
sensor_id INT NOT NULL,
temperature DOUBLE PRECISION,
humidity DOUBLE PRECISION,
pressure DOUBLE PRECISION
);
SELECT create_hypertable('iot_data', 'time');
-- For demonstration, let's add a dummy timestamp column for policy
-- In a real scenario, 'time' is usually the column for this.
ALTER TABLE iot_data ADD COLUMN fecha_compression_policy_start TIMESTAMPTZ DEFAULT NOW();
UPDATE iot_data SET fecha_compression_policy_start = time; -- Set it to the actual event time
-- Now, define the policy to compress data older than 1 day
ALTER TABLE iot_data SET (timescaledb.compress, timescaledb.compress_segmentby = 'sensor_id');
CREATE COMPRESSION POLICY iot_data_compress_policy
ON iot_data
START fecha_compression_policy_start
END (fecha_compression_policy_start + INTERVAL '1 day') -- Compress data older than 1 day
WITH (
batches = 0 -- Use 0 for automatic batching, or specify a number like 100
);
-- Manually compress data older than 1 day for immediate effect (for this example)
-- In production, the background compression job handles this.
SELECT add_compression_policy('iot_data', INTERVAL '1 day');
The timescaledb.compress_segmentby = 'sensor_id' clause is key. It tells TimescaleDB to group rows with the same sensor_id together before applying compression within a chunk. This is highly effective for time-series data where many rows share common dimensions.
Now, when you run that same query:
EXPLAIN ANALYZE
SELECT avg(temperature) FROM iot_data WHERE time BETWEEN '2023-10-01T00:00:00Z' AND '2023-10-01T01:00:00Z';
The query planner, knowing that only the temperature column is needed, will instruct the storage engine to only read the compressed temperature column data for the relevant time range and sensor_id segments. This bypasses reading the time, sensor_id, humidity, and pressure columns entirely for this specific query. The result is dramatically reduced I/O and faster query execution.
The compression itself uses a variety of codecs, selected automatically by TimescaleDB based on the data type and characteristics of each column. For DOUBLE PRECISION columns like temperature, it might use delta-encoding, XOR encoding, or simple run-length encoding if there are repetitive values. For integer types, it might use variable-length encoding.
The true power of TimescaleDB compression, beyond simple space savings, lies in its columnar nature after compression. When a query only needs a subset of columns, TimescaleDB can efficiently decompress and return only those columns. This is a fundamental difference from traditional row-based storage, where every column for a given row must be read, even if only one column is ultimately used.
A common point of confusion is that compression is applied per chunk. TimescaleDB automatically manages chunking based on your time partitioning. When data ages beyond the policy’s END condition, TimescaleDB’s background workers will compress that chunk. You can monitor compression status using timescaledb_information.compressed_chunks.
The most surprising thing about TimescaleDB compression is that it doesn’t require ANALYZE or VACUUM to maintain its effectiveness; the compression and decompression are handled transparently by the query planner and storage engine, and the background compression jobs work independently of these traditional PostgreSQL maintenance tasks.
The next step after mastering compression is exploring how to tune compression settings per column for even finer-grained control.