TimescaleDB’s COPY command, when used for bulk loading historical data, can be surprisingly inefficient if you’re not careful about how you structure your input.

Let’s say you’ve got a CSV file with years of sensor readings, and you want to shove it into a TimescaleDB hypertable. A naive COPY might look like this:

COPY sensor_data FROM '/path/to/your/sensor_data.csv' WITH (FORMAT CSV, HEADER);

This seems straightforward, right? But if that CSV is massive, or if your hypertable has a complex partitioning scheme, you might find it grinding to a halt or, worse, erroring out with out-of-memory issues. The real magic (and potential pain) lies in how TimescaleDB handles the COPY operation under the hood, especially with hypertables.

Here’s how to actually make it sing.

1. Pre-Sort Your Data:

TimescaleDB hypertables are partitioned by time. If your CSV is not sorted chronologically, COPY has to do a lot of extra work to figure out which chunk each row belongs to. This involves random disk seeks and can kill performance.

  • Diagnosis: Before copying, check if your CSV is sorted. You can do this with a simple sort command on Linux/macOS:
    head -n 1000000 sensor_data.csv > sorted_sensor_data.csv
    tail -n +2 sensor_data.csv | sort -t, -k1,1 >> sorted_sensor_data.csv
    
    (This assumes the timestamp is the first column and the delimiter is a comma. Adjust -k1,1 and -t, as needed.)
  • Fix: Always pre-sort your data by the time column.
    sort -t, -k1,1 sensor_data.csv > sorted_sensor_data.csv
    
    Then copy from the sorted file:
    COPY sensor_data FROM '/path/to/your/sorted_sensor_data.csv' WITH (FORMAT CSV, HEADER);
    
  • Why it works: When data is sorted by time, TimescaleDB can process it sequentially, writing to chunks in a predictable order and reducing the overhead of chunk discovery and data placement.

2. Batch Your Imports:

Copying billions of rows in a single COPY command is a recipe for disaster. PostgreSQL (and thus TimescaleDB) has limits on transaction size and memory usage.

  • Diagnosis: If your COPY fails with OOM errors or FATAL: remaining connection slots are reserved for non-replication superuser connections, you’re likely trying to do too much at once.
  • Fix: Split your large CSV into smaller files, each representing a manageable time range (e.g., a day, a week, or a month, depending on your data volume). Then, run COPY for each file sequentially.
    # Example: Split by year (assuming timestamp is first column, YYYY-MM-DD HH:MM:SS)
    awk -F, '$1 ~ /^2022/ {print > "sensor_data_2022.csv"}' sensor_data.csv
    awk -F, '$1 ~ /^2023/ {print > "sensor_data_2023.csv"}' sensor_data.csv
    
    # Then copy each file
    COPY sensor_data FROM '/path/to/your/sensor_data_2022.csv' WITH (FORMAT CSV, HEADER);
    COPY sensor_data FROM '/path/to/your/sensor_data_2023.csv' WITH (FORMAT CSV, HEADER);
    
  • Why it works: Each COPY command runs in its own transaction. Smaller transactions are less likely to exhaust memory or hit transaction limits. This also allows TimescaleDB to manage chunk creation and data insertion more granularly.

3. Disable Unnecessary Indexes (Temporarily):

Indexes on your hypertable are great for querying, but they add significant overhead during bulk inserts.

  • Diagnosis: If your COPY is slow even with sorted data and small batches, check your \d+ sensor_data output in psql.
  • Fix: Drop indexes that aren’t critical for the loading process.
    DROP INDEX IF EXISTS sensor_data_time_idx; -- Assuming this is your primary time index
    DROP INDEX IF EXISTS sensor_data_device_id_idx;
    -- ... drop other indexes ...
    
    After the load, recreate them:
    CREATE INDEX sensor_data_time_idx ON sensor_data (time DESC); -- Or whatever your hypertable uses
    CREATE INDEX sensor_data_device_id_idx ON sensor_data (device_id);
    -- ... recreate other indexes ...
    
  • Why it works: During COPY, every inserted row needs to be indexed. Dropping indexes means fewer operations per row, drastically speeding up the ingest. Recreating them afterwards ensures query performance.

4. Use timescaledb.compress Wisely:

If you intend to compress your data, doing it after bulk loading is much more efficient than trying to compress data as it’s being inserted via COPY.

  • Diagnosis: If you’re seeing high disk I/O and CPU usage during COPY even on a well-indexed and sorted dataset, it might be that compression is being attempted implicitly or is poorly configured.
  • Fix: Load the raw, uncompressed data first. Then, use TimescaleDB’s compression features to compress the chunks after the load is complete.
    -- Load data as described above (sorted, batched, without indexes)
    -- ... COPY commands ...
    
    -- Then compress the chunks
    ALTER TABLE sensor_data SET (timescaledb.compress, timescaledb.compress_segmentby = 'device_id', timescaledb.compress_chunk_time_interval = '1 week');
    SELECT add_compression('sensor_data', interval '1 week'); -- Manually trigger compression if needed
    
  • Why it works: Compression is a CPU-intensive operation. Batching it post-load allows you to dedicate resources to it without interfering with the raw data ingestion speed. compress_segmentby and compress_chunk_time_interval tune how efficiently this happens.

5. Tune PostgreSQL Settings:

Bulk loading is a heavy operation that benefits from specific PostgreSQL configurations.

  • Diagnosis: If all else fails, or if your COPY is still sluggish, check your postgresql.conf (or postgresql.auto.conf).
  • Fix: Temporarily adjust these parameters for the duration of your bulk load:
    • maintenance_work_mem: Increase significantly (e.g., 512MB or 1GB) to speed up index creation and COPY itself.
    • wal_level: Set to minimal (if you don’t need point-in-time recovery during the load) and fsync to off for the duration of the bulk load. WARNING: This drastically reduces data safety. Revert these changes immediately after the load.
    • synchronous_commit: Set to off.
    • max_wal_size: Increase substantially (e.g., 4GB or 8GB) to avoid frequent WAL checkpointing.
    # In postgresql.conf or via ALTER SYSTEM
    maintenance_work_mem = 1GB
    wal_level = minimal
    fsync = off
    synchronous_commit = off
    max_wal_size = 8GB
    
    Remember to pg_ctl reload or SELECT pg_reload_conf(); after changing settings.
  • Why it works: These settings reduce the overhead of writing to the Write-Ahead Log (WAL) and increase the memory available for maintenance tasks like indexing and copying, allowing the database to process data much faster.

Once you’ve successfully loaded your historical data, you might encounter a new, albeit less critical, issue: "Out of memory while processing query" if you try to query across all your historical data without proper indexing or filtering.

Want structured learning?

Take the full Timescaledb course →