A TimescaleDB skip scan index can dramatically speed up DISTINCT queries, but not in the way you’d intuitively expect.

Let’s say you have a table sensor_data with a device_id and a timestamp, and you want to find all the unique device_ids that have data within a specific time range. A naive approach might look like this:

SELECT DISTINCT device_id
FROM sensor_data
WHERE timestamp >= '2023-10-01' AND timestamp < '2023-10-02';

Without any special indexing, PostgreSQL (and by extension, TimescaleDB) would likely scan a large portion of the sensor_data table, collect all device_ids, and then perform a sort and unique operation. This can be very slow on large tables.

Now, let’s introduce a skip scan index. TimescaleDB, built on PostgreSQL, leverages PostgreSQL’s indexing capabilities. A common pattern for time-series data is a composite index on (time, device_id).

CREATE INDEX sensor_data_time_device_idx ON sensor_data (timestamp, device_id);

If you query for a specific device_id within a time range, this index is excellent. But for DISTINCT device_id across a time range, it’s still not ideal because the index is ordered by timestamp first.

The magic happens when you consider a different index structure or how PostgreSQL can use an index for DISTINCT. The key is that PostgreSQL can use an index to satisfy DISTINCT if the columns in the DISTINCT clause are at the beginning of the index.

Consider this index:

CREATE INDEX sensor_data_device_time_idx ON sensor_data (device_id, timestamp);

Now, if you query SELECT DISTINCT device_id FROM sensor_data WHERE timestamp >= '2023-10-01' AND timestamp < '2023-10-02';, PostgreSQL might still struggle because the WHERE clause is on timestamp, not device_id.

This is where the skip scan index, specifically a functional skip scan index or a carefully crafted B-tree index, comes into play for DISTINCT.

The most effective way to speed up DISTINCT device_id within a time range is often to have an index starting with the column you want distinct values of, and a way to efficiently filter by time.

Let’s say you have a hypertable sensor_data with device_id and timestamp.

-- Assume this is a hypertable
CREATE TABLE sensor_data (
    timestamp TIMESTAMPTZ NOT NULL,
    device_id INT NOT NULL,
    temperature DOUBLE PRECISION,
    -- other columns
);

SELECT create_hypertable('sensor_data', 'timestamp');

If your primary query pattern is SELECT DISTINCT device_id FROM sensor_data WHERE timestamp BETWEEN 'start_time' AND 'end_time', the ideal index is:

CREATE INDEX sensor_data_device_id_ts_idx ON sensor_data (device_id, timestamp);

Here’s how this index works for DISTINCT device_id:

  1. Index Scan: PostgreSQL can use this index.
  2. device_id as the Leading Column: Because device_id is the first column in the index, PostgreSQL can perform an index-only scan or at least a very efficient index scan. For each unique device_id in the index, it will look at the associated timestamp values.
  3. Filtering by timestamp: The WHERE timestamp BETWEEN 'start_time' AND 'end_time' clause can then be applied to the timestamp portion of the index after it has conceptually grouped by device_id.

When EXPLAIN ANALYZE is run on SELECT DISTINCT device_id FROM sensor_data WHERE timestamp BETWEEN '2023-10-01' AND '2023-10-02'; with the sensor_data_device_id_ts_idx index, you’ll often see an Index Scan using sensor_data_device_id_ts_idx or Index Only Scan where the Sort Key or the implicit grouping is handled by the index structure itself.

Common Causes of Slow DISTINCT Queries on Time-Series Data:

  1. Missing or Inefficient Index: The most common culprit. A full table scan or a scan on an index not optimized for DISTINCT on the desired column.

    • Diagnosis: EXPLAIN ANALYZE SELECT DISTINCT device_id FROM sensor_data WHERE timestamp BETWEEN '2023-10-01' AND '2023-10-02'; Look for Seq Scan or Index Scan on an inappropriate index.
    • Fix: CREATE INDEX sensor_data_device_id_ts_idx ON sensor_data (device_id, timestamp);
    • Why it works: This index allows PostgreSQL to efficiently find all device_id entries and then filter by timestamp within those entries, effectively giving you distinct device_ids that fall within the time range without a full table sort.
  2. device_id Not Leading in the Index: If your index is (timestamp, device_id), it’s great for time-range queries but not for DISTINCT device_id within a time range.

    • Diagnosis: Same as above. The EXPLAIN plan will likely show a Sort operation after an Index Scan on (timestamp, device_id), indicating PostgreSQL had to sort the results to find distinct values.
    • Fix: DROP INDEX IF EXISTS sensor_data_time_device_idx; CREATE INDEX sensor_data_device_id_ts_idx ON sensor_data (device_id, timestamp);
    • Why it works: By placing device_id first, the index inherently groups data by device_id. When querying for distinct device_ids, PostgreSQL can traverse the index, and the timestamp filter is applied to the entries within each device_id block.
  3. Large Number of Unique device_ids: If every record has a unique device_id (unlikely for sensor data, but possible), DISTINCT will always have to process many values. The index helps, but the nature of the data is the bottleneck.

    • Diagnosis: EXPLAIN ANALYZE will still show an index scan, but the rows removed by DISTINCT or the final rows count will be very close to the number of rows scanned.
    • Fix: This is often a data modeling or query design issue. If possible, use a more aggregated approach or ensure your device_id is truly representative of a device. If not, the index is still the best you can do.
    • Why it works: The index minimizes the work to get to the distinct values, but it can’t create distinctness where none exists in the data.
  4. Data Skew: If one device_id has an extremely high volume of data within the queried time range compared to others, the query planner might still opt for a less optimal plan if it misestimates the cost.

    • Diagnosis: EXPLAIN ANALYZE might show a plan that seems reasonable but is slow in practice. The actual row counts in the plan might differ significantly from the estimated ones.
    • Fix: Sometimes, a manual ANALYZE sensor_data; or VACUUM ANALYZE sensor_data; can help PostgreSQL’s statistics. In extreme cases, consider partial indexes or separate tables for extremely high-volume devices if query patterns diverge.
    • Why it works: Better statistics help the query planner make more informed decisions about index usage and join strategies.
  5. Incorrect DISTINCT ON Usage (or not using it): Sometimes, the intent isn’t a pure DISTINCT but rather "the latest entry for each device." DISTINCT ON is powerful here.

    • Diagnosis: The query is written as SELECT DISTINCT device_id, timestamp FROM ... when SELECT DISTINCT ON (device_id) device_id, timestamp FROM ... ORDER BY device_id, timestamp DESC; would be more appropriate and faster with the right index.
    • Fix: Rewrite the query to use DISTINCT ON (device_id). Ensure the ORDER BY clause matches the index (device_id, timestamp DESC).
    • Why it works: DISTINCT ON is a PostgreSQL-specific, highly optimized way to get unique rows based on certain columns, leveraging the ORDER BY clause.
  6. Large Time Range: Even with the best index, if the time range covers a massive portion of your data, the number of distinct device_ids to process will be large, and the index scan will still touch many index entries.

    • Diagnosis: EXPLAIN ANALYZE shows an efficient Index Scan, but the rows examined count is extremely high, and the final rows count is also high, indicating many distinct devices were found.
    • Fix: Refine the time range if possible. If not, consider pre-aggregating distinct device counts for wider time buckets if your application can tolerate less real-time data.
    • Why it works: You’re asking for a lot of information. The index makes it as efficient as possible, but the sheer volume of distinct devices within a vast time window is an inherent data characteristic.

After ensuring the (device_id, timestamp) index is in place and the query is correctly written, the next potential bottleneck you might encounter is if you then try to select other columns alongside DISTINCT device_id, forcing PostgreSQL to go back to the table for each device_id if an index-only scan isn’t possible for those additional columns.

Want structured learning?

Take the full Timescaledb course →