TimescaleDB’s event deduplication isn’t about preventing duplicate writes at the application layer; it’s about ensuring a consistent state after potentially concurrent or retried writes have already hit the database.
Let’s look at how this works in practice. Imagine you have a sensor sending readings every second, but sometimes network glitches cause a sensor to retry sending the same reading twice. You want only one record per sensor ID and timestamp.
Here’s a simplified setup:
CREATE TABLE sensor_readings (
time TIMESTAMPTZ NOT NULL,
device_id INT NOT NULL,
temperature DOUBLE PRECISION
);
-- Create a hypertable
SELECT create_hypertable('sensor_readings', 'time');
-- Add a unique constraint for deduplication
ALTER TABLE sensor_readings ADD CONSTRAINT unique_reading UNIQUE (time, device_id);
Now, if you try to insert the same reading twice, the second insert will fail. This is the application-level prevention. But what if your application doesn’t check for this failure and just retries? Or what if two different application instances try to insert the same data concurrently?
This is where TimescaleDB’s ON CONFLICT clause comes into play, specifically with DO NOTHING.
Consider this scenario:
-- First insert (succeeds)
INSERT INTO sensor_readings (time, device_id, temperature)
VALUES ('2023-10-27 10:00:00+00', 1, 25.5);
-- Second insert (duplicate, but we want to handle it gracefully)
INSERT INTO sensor_readings (time, device_id, temperature)
VALUES ('2023-10-27 10:00:00+00', 1, 25.5)
ON CONFLICT (time, device_id) DO NOTHING;
The DO NOTHING clause tells PostgreSQL (and by extension, TimescaleDB) that if the INSERT statement violates the unique_reading constraint (because a row with time = '2023-10-27 10:00:00+00' and device_id = 1 already exists), it should simply do nothing and not raise an error. The statement will complete successfully, but no new row will be inserted.
The result? Your sensor_readings table still contains only one row for that specific time and device.
The mental model here is that the UNIQUE constraint is the gatekeeper. The ON CONFLICT clause is how you instruct the database on what to do when that gatekeeper raises its hand. DO NOTHING means "if the gate is closed for this entry, just walk away quietly."
The key levers you control are:
- The
UNIQUEconstraint: This defines what constitutes a duplicate. It must include your time column and any other columns that, together, should uniquely identify a logical record. - The
ON CONFLICTclause: This determines the action taken when a conflict is detected.DO NOTHINGis for simple deduplication where you don’t need to update any existing data. - The
INSERTstatement: This is where you apply theON CONFLICTclause.
This is incredibly powerful for systems that might produce duplicate events, like IoT devices with unreliable networks or distributed systems that might retry operations. It allows your application to be more resilient without needing complex application-level logic to track what has already been successfully written.
What many people don’t realize is that ON CONFLICT isn’t limited to DO NOTHING. You can also use DO UPDATE SET ... to perform an "upsert" – insert if it doesn’t exist, or update if it does. This is crucial if, for example, a later event for the same unique identifier might contain more up-to-date information (e.g., a status update or a corrected reading). For deduplication purposes though, DO NOTHING is the direct answer.
The next step after mastering deduplication is often handling data that should be updated rather than just ignored, which leads to exploring ON CONFLICT DO UPDATE.